7,842 Matching Annotations
  1. Oct 2025
    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      In the manuscript by Su et al., the authors present a massively parallel reporter assay (MPRA) measuring the stability of in vitro transcribed mRNAs carrying wild-type or mutant 5' or 3' UTRs transfected into two different human cell lines. The goal presented at the beginning of the manuscript was to screen for effects of disease-associated point mutations on the stability of the reporter RNAs carrying partial human 5' or 3' UTRs. However, the majority of the manuscript is dedicated to identifying sequence components underlying the differential stability of reporter constructs. This shows that TA dinucleotides are the most predictive feature of RNA stability in both cell lines and both UTRs.

      The effect of AU rich elements (AREs) on RNA stability is well established in multiple systems, and the present study confirms this general trend but points out variability in the consequence of seemingly similar motifs on RNA stability. For example, the authors report that a long stretch of Us has extreme opposite effects on RNA stability depending on whether it is preceded by an A (strongly destabilizing) or followed by an A (strongly stabilizing). While the authors interpretation of a context- dependence of the effect is certainly well-founded, it seems counterintuitive that the preceding or following A would be the (only) determining factor. This points to a generally reductionist approach taken by the authors in the analysis of the data and in their attempt to dissect the contribution of "AU rich sequences" to RNA stability, with a general tendency to reduce the size and complexity of the features (e.g. to dinucleotides). While this certainly increases the statistical power of the analysis due to the number of occurrences of these motifs, it limits the interpretability of the results. How do TA dinucleotides per se contribute to destabilizing the RNA, both in 5' and 3' UTRs, but (according to limited data presented) not in coding sequences? What is the mechanism? RBPs binding to TA dinucleotide containing sequences are suggested to "mask" the destabilizing effect, thereby leading to a more stable RNA. Gain of TA dinucleotides is reported to have a destabilizing effect, but again no hypothesis is provided as to the underlying molecular mechanism. In addition to reducing the motif length to dinucleotides, the notion of "context dependence" is used in a very narrow sense; especially when focusing on simple and short motifs, a more extensive analysis of the interdependence of these features (beyond the existing analysis of the relationship between TA- diNTs and GC content) could potentially reveal more of the context dependence underlying the seemingly opposite behavior of very similar motifs.

      (We have used UA instead of TA, as per the reviewer's suggestion)

      The contribution of coding region sequence to RNA stability has been extensively discussed (For example: doi.org/10.1016/j.molcel.2022.03.032; doi.org/10.1186/s13059-020-02251-5; doi.org/10.15252/embr.201948220; doi.org/10.1371/journal.pone.0228730; doi.org/10.7554/eLife.45396). While UA content at the third codon position (wobble position) has been implicated as a pro-degradation signal, codon optimality has emerged as the most prominent determinant for RNA stability. This indicates that the role of coding regions in RNA stability differs from that of UTRs due to the involvement of translation elongation. We did not intend to suggest that UA-dinucleotides in UTRs and coding regions have the same effect. 

      To ensure the representativeness of the features entered into the LASSO model, we pre-selected those with an occurrence greater than 10% among all UTRs. As a result, while motifs with very low occurrences were excluded from the analysis, there is no evidence to indicate a preference for dinucleotides by the LASSO model.

      We hypothesize that UA-dinucleotide may recruit endonucleases RNase A family, whose catalytic pockets exhibit a strong bias for UA dinucleotide (doi.org/10.1016/j.febslet.2010.04.018). Structures or protein bindings that block this recognition might stabilize RNAs. To gain further insight into the motif interactions, we investigated the interactions between UA and other 15 dinucleotides through more detailed analyses. We conducted a linear regression analysis investigating interactions between UA and the other 15 dinucleotides. The formula used below includes UA:

      , where all 𝛽 terms represent the regression coefficients, and , , and represent the number of UA dinucleotides, the number of other dinucleotides (other than UA), and the GC content of the i<sup>th</sup> UTR, respectively, and 𝜖<sub>i</sub> denotes the error term. For each dinucleotide, we tested the significance of 𝛽<sub>UAxGC%</sub> and 𝛽<sub>UAxDiNT</sub>, and compared their p-values using a quantile-quantile (QQ) plot. Author response image 1 shows that the interaction effect of UA dinucleotides with GC% is much more significant than interactions with the other 15 dinucleotides, as indicated by the inflated QQ plot of p-values. This suggests that GC content is a more critical contextual factor influencing UA dinucleotides' impact on RNA stability.

      Author response image 1.

      The present MPRAs measures the effect of UTR sequences in one specific reporter context and using one experimental approach (following the decay of in vitro transcribed and transfected RNAs). While this approach certainly has its merits compared to other approaches, it also comes with some caveats: RNA is delivered naked, without bound RBPs and no nuclear history, e.g. of splicing (no EJCs), editing and modifications. One way to assess the generalizability of the results as well as the context dependence of the effects is to perform the same analysis on existing datasets of RNA stability measurements obtained through other methods (e.g. transcription inhibition). Are TA dinucleotides universally the most predictive feature of RNA half-lives?

      Our system studies the stability control of RNA synthesized in vitro and delivered into human cells. While we did not intend to generalize our conclusions to endogenous RNAs, our approach contributes to the understanding of in vitro synthesized RNA used for cellular expression, such as in vaccines. It is known that endogenous RNAs undergo very different regulation. The most prominent factors controlling endogenous RNA stability are the density of splice junctions and the length of UTRs (doi.org/10.1186/s13059-022-02811-x; doi.org/10.1186/s12915-021-00949-x). To decipher the sequence regulation, these factors are controlled in our experiments. Therefore, we do not expect the dinucleotide features found by our approach to be generalized as the most predictive feature of RNA half-life in vivo. 

      The authors conclude their study with a meta-analysis of genes with increased TA dinucleotides in 5' and 3'UTRs, showing that specific functional groups are overrepresented among these genes. In addition, they provide evidence for an effect of disease-associated UTR mutations on endogenous RNA stability. While these elements link back to the original motivation of the study (screening for effects of point mutations in 5' and 3' UTRs), they provide only a limited amount of additional insights.

      We utilized the Taiwan Biobank to investigate whether mutations significantly affecting RNA stability also impact human biochemical measurements. Our findings indicate that these mutations indeed have a significant effect on various biochemical indices. This highlights the importance of our study, as it bridges basic science with potential applications in precision medicine. By linking specific UTR mutations with measurable changes in biochemical indices, our research underscores the potential for these findings to inform targeted medical interventions in the future.

      In summary, this manuscript presents an interesting addition to the long-standing attempts at dissecting the sequence basis of RNA stability in human cells. The analysis is in general very comprehensive and sound; however, at times the goal of the authors to find novelty and specificity in the data overshadows some analyses. One example is the case where the authors try to show that TA-dinucleotides and GC content are decoupled and not merely two sides of the same coin.

      They claim that the effect of TA dinucleotides is different between high- and low-GC content contexts but do not control for the fact that low GC-content regions naturally will contain more TA dinucleotides and therefore the effect sizes and the resulting correlation between TA-diNT rate and stability will be stronger (Fig. 5A). A more thorough analysis and greater caution in some of the claims could further improve the credibility of the conclusions.

      Low GC content implies a higher UA content but does not directly equate to a high UA-dinucleotide ratio. For instance, the sequence AUUGAACCUU has a lower GC content (0.3) compared to UAUAGGCCGC (0.6), yet it also has a lower UA-dinucleotide ratio (0 vs. 0.22). To address this concern more rigorously, we performed a stratified analysis based on UA-diNT rate. As shown in our Fig. S7C, even after stratifying by UA- dinucleotide ratio (upper panel high UA- dinucleotide ratio / lower panel low UA- dinucleotide ratio), we still observe that the destabilizing effect of UA is stronger in the low GC content group.

      Reviewer #2 (Public Review):

      Summary of goals:

      Untranslated regions are key cis-regulatory elements that control mRNA stability, translation, and translocation. Through interactions with small RNAs and RNA binding proteins, UTRs form complex transcriptional circuitry that allows cells to fine-tune gene expression. Functional annotation of UTR variants has been very limited, and improvements could offer insights into disease relevant regulatory mechanisms. The goals were to advance our understanding of the determinants of UTR regulatory elements and characterize the effects of a set of "disease-relevant" UTR variants.

      Strengths:

      The use of a massively parallel reporter assay allowed for analysis of a substantial set (6,555 pairs) of 5' and 3' UTR fragments compiled from known disease associated variants. Two cell types were used.

      The findings confirm previous work about the importance of AREs, which helps show validity and adds some detailed comparisons of specific AU-rich motif effects in these two cell types.

      Using a Lasso regression, TA-dinucleotide content is identified as a strong regulator of RNA stability in a context dependent manner based on GC content and presence of RNA binding protein binding motifs. The findings have potential importance, drawing attention to a UTR feature that is not well characterized.

      The use of complementary datasets, including from half-life analyses of RNAs and from random sequence library MRPA's, is a useful addition and supports several important findings. The finding the TA dinucleotides have explanatory power separate from (and in some cases interacting with) GC content is valuable.

      The functional enrichment analysis suggests some new ideas about how UTRs may contribute to regulation of certain classes of genes.

      Weaknesses:

      It is difficult to understand how the calculations for half-life were performed. The sequencing approach measures the relative frequency of each sequence at each time point (less stable sequences become relatively less frequent after time 0, whereas more stable sequences become relatively more frequent after time 0). Since there is no discussion of whether the abundance of the transfected RNA population is referenced to some external standard (e.g., housekeeping RNAs), it is not clear how absolute (rather than relative) half-lives were determined.

      We estimated decay constant λ and half-life (t<sub>1/2</sub>) by the following equations:

      where C<sub>i(t)</sub> and C<sub>i(t=0)</sub> are read count values of the ith replicate at time points 𝑡 and 0 (see also Methods). The absolute abundance was not required for the half-life calculation. 

      Fig. S1A and B are used to assess reproducibility. They show that read counts at a given time point correlate well across replicate experiments. However, this is not a good way to assess reproducibility or accuracy of the measurements of t1/2 are. (The major source of variability in read counts in these plots - especially at early time points - is likely the starting abundance of each RNA sequence, not stability.) This creates concerns about how well the method is measuring t1/2. Also creating concern is the observation that many RNAs are associated with half-lives that are much longer than the time points analyzed in the study. For example, based upon Figure S1 and Table S1 correctly, the median t1/2 for the 5' UTR library in HEK cells appears to be >700 minutes. Given that RNA was collected at 30, 75, and 120 minutes, accurate measurements of RNAs with such long half lives would seem to be very difficult.

      We estimated the half-life based on the following equations:

      where C<sub>i(t)</sub> and C<sub>i(t=0)</sub> are read count values of the ith replicate at time points 𝑡 and 0 (see also Methods). The calculation of the half-life involves first determining the decay constant 𝜆, which represents a constant rate of decay. Since 𝜆 is a constant, it is possible to accurately calculate it without needing data over the entire decay range. Our experimental design considers this by selecting appropriate time points to ensure a reliable estimation of 𝜆, and thus, the half-life. To determine the most suitable time points, we conducted preliminary experiments using RT-PCR.

      These experiments indicated that 30, 75, and 120 minutes provided an effective range for capturing the decay dynamics of the transcripts.

      There is no direct comparison of t1/2 between the two cell types studied for the full set of sequences studied. This would be helpful in understanding whether the regulatory effects of UTRs are generally similar across cell lines (as has been shown in some previous studies) or whether there are fundamental differences. The distribution of t1/2's is clearly quite different in the two cell lines, but it is important to know if this reflects generally slow RNA turnover in HEK cells or whether there are a large number of sequence-specific effects on stability between cell lines. A related issue is that it is not clear whether the relatively small number of significant variant effects detected in HEK cells versus SH-SY5Y cells is attributable to real biological differences between cell types or to technical issues (many fewer read counts and much longer half lives in HEK cells).

      For both cell lines, we selected oligonucleotides with R<sup>2</sup> > 0.5 and mean squared error (MSE) < 1 for analysis when estimating half-life (λ) by linear regression. This selection criterion was implemented to minimize the effect of experimental noise. After quality control, we selected common UTRs and compared the RNA half-lives of the two cell lines using a scatter plot. Author response image 2 shows that RNA half-lives are quite different between the cell lines, with a moderate similarity observed in the 5' UTRs (R = 0.21), while the correlation in the 3' UTRs is non-significant.

      Author response image 2.

      Despite the low correlation of mRNA half-life between the two cell lines, UA-dinucleotide and UA-rich sequences consistently emerge as the most significant destabilizing features, suggesting a shared regulatory mechanism across diverse cellular environments.

      The general assertion is made in many places that TA dinucleotides are the most prominent destabilizing element in UTRs (e.g., in the title, the abstract, Fig. 4 legend, and on p. 12). This appears to be true for only one of the two cell lines tested based on Fig. 3.

      UA-dinucleotides and other UA-rich sequences exhibit similar effects on RNA stability, as illustrated in Fig. S5A-C. In two cell lines, UA-dinucleotide and WWWWWW sequences were representatives of the same stability-affecting cluster. While the impact of UA-dinucleotides can be generalized, we have rephrased some statements for clarification to avoid any potential misunderstanding. For examples: 

      Abstract: “...We found that UA dinucleotides and UA-rich motifs are the most prominent destabilizing element.“

      p.10: “UA dinucleotides and UA-rich motifs are the most common and effective RNA destabilizing factor” 

      Figure 4: “The UTR UA dinucleotides and UA-rich motifs are the most common and influential RNA destabilizing factor.”

      Appraisal and impact:

      The work adds to existing studies that previously identified sequence features, including AREs and other RNA binding protein motifs, that regulate stability and puts a new emphasis on the role of "TA" (better "UA") dinucleotides. It is not clear how potential problems with the RNA stability measurements discussed above might influence the overall conclusions, which may limit the impact unless these can be addressed.

      It is difficult to understand whether the importance of TA dinucleotides is best explained by their occurrence in a related set of longer RBP binding motifs (see Fig 5J, these motifs may be encompassed by the "WWWWWW cluster") or whether some other explanation applies. Further discussion of this would be helpful. Does the LASSO method tend to collapse a more diverse set of longer motifs that are each relatively rare compared to the dinucleotide? It remains unclear whether TA dinucleotides are associated with less stability independent of the presence of the known larger WWWWWWW motif. As noted above, the importance of TA dinucleotides in the HEK experiments appears to be less than is implied in the text.

      To ensure the representativeness of the features entered into the LASSO model, we pre-selected those with an occurrence greater than 10% among all UTRs. There is no evidence to support a preference for dinucleotides by LASSO. To address whether the destabilizing effect of UA dinucleotides is part of the broader WWWWWW motif, we divided UA dinucleotides into two groups: those within the WWWWWW motif and those outside of it. Specifically, we divided UTRs into two categories: 'at least one UA within a WWWWWW motif' and 'no UA within a WWWWWW motif,' and visualized the results using a boxplot. As shown in Author response image 3, the destabilizing trend still remains for UA dinucleotides outside of the WWWWWW motif, although the effect appears to be more pronounced when UA is within the WWWWWW motif. This suggests that while UA dinucleotides have a destabilizing effect independently, their impact is amplified when they are part of the broader WWWWWW motif.

      Author response image 3.

      The inclusion of more than a single cell type is an acknowledgement of the importance of evaluating cell type-specific effects. The work suggests a number of cell type-specific differences, but due to technical issues (especially with the HEK data, as outlined above) and the use of only two cell lines, it is difficult to understand cell type effects from the work.

      The inclusion of both 3' and 5' UTR sequences distinguishes this work from most prior studies in the field. Contrasting the effects of these regions on stability is of interest, although the role of these UTRs (especially the 5' UTR) in translational regulation is not assessed here.

      We examined the role of UTR and UTR variants in translation regulation using polysome profiling. By both univariate analysis and an elastic regression model, we identified motifs of short repeated sequences, including SRSF2 binding sites, as mutation hotspots that lead to aberrant translation. Furthermore, these polysome-shifting mutations had a considerable impact on RNA secondary structures, particularly in upstream AUG-containing 5’ UTRs. Integrating these features, our model achieved high accuracy (AUROC > 0.8) in predicting polysome-shifting mutations in the test dataset. Additionally, metagene analysis indicated that pathogenic variants were enriched at the upstream open reading frame (uORF) translation start site, suggesting changes in uORF usage underlie the translation deficiencies caused by these mutations. Illustrating this, we demonstrated that a pathogenic mutation in the IRF6 5’ UTR suppresses translation of the primary open reading frame by creating a uORF. Remarkably, site-directed ADAR editing of the mutant mRNA rescued this translation deficiency. Because the regulation of translation and stability does not converge, we illustrate these two mechanisms in two separate manuscripts (this one and doi.org/10.1101/2024.04.11.589132).

      Reviewer #3 (Public Review):

      Summary:

      In their manuscript titled "Multiplexed Assays of Human Disease‐relevant Mutations Reveal UTR

      Dinucleotide Composition as a Major Determinant of RNA Stability" the authors aim to investigate the effect of sequence variations in 3'UTR and 5'UTRs on the stability of mRNAs in two different human cell lines.

      To do so, the authors use a massively parallel reporter assay (MPRA). They transfect cells with a set of mRNA reporters that contain sequence variants in their 3' or 5' UTRs, which were previously reported in human diseases. They follow their clearance from cells over time relative to the matching non-variant sequence. To analyze their results, they define a set of factors (RBP and miRNA binding sites, sequence features, secondary structure etc.) and test their association with differences in mRNA stability. For features with a significant association, they use clustering to select a subset of factors for LASSO regression and identify factors that affect mRNA stability.

      They conclude that the TA dinucleotide content of UTRs is the strongest destabilizing sequence feature. Within that context, elevated GC content and protein binding can protect susceptible mRNAs from degradation. They also show that TA dinucleotide content of UTRs affects native mRNA stability, and that it is associated with specific functional groups. Finally, they link disease associated sequence variants with differences in mRNA stability of reporters.

      Strengths:

      This work introduces a different MPRA approach to analyze the effect of genetic variants. While previous works in tissue culture use DNA transfections that require normalization for transcription efficiency, here the mRNA is directly introduced into cells at fixed amounts, allowing a more direct view of the mRNA regulation.

      The authors also introduce a unique analysis approach, which takes into account multiple factors that might affect mRNA stability. This approach allows them to identify general sequence features that affect mRNA stability beyond specific genetic variants, and reach important insights on mRNA stability regulation. Indeed, while the conclusions to genetic variants identified in this work are interesting, the main strength of the work involve general effect of sequence features rather than specific variants.

      The authors provide adequate supports for their claims, and validate their analysis using both their reporter data and native genes. For the main feature identified, TA di-nucleotides, they perform follow-up experiments with modified reporters that further strengthen their claims, and also validate the effect on native cellular transcripts (beyond reporters), demonstrating its validity also within native scenarios.

      The work provides a broad analysis of mRNA stability, across two mRNA regulatory segments (3'UTR and 5'UTR) and is performed in two separate cell-types. Comparison between two different cell-types is adequate, and the results demonstrate, as expected, the dependence of mRNA stability on the cellular context. Analysis of 3'UTR and 5'UTR regulatory effects also shows interesting differences and similarities between these two regulatory regions.

      Weaknesses:

      (1) The authors fail to acknowledge several possible confounding factors of their MPRA approach in the discussion.

      First, while transfection of mRNA directly into cells allows to avoid the need to normalize for differences in transcription, the introduction of naked mRNA molecules is different than native cellular mRNAs and could introduce biases due to differences in mRNA modifications, protein associations etc. that may occur co-transcriptionally.

      Second, along those lines, the authors also use in-vitro polyadenylation. The length of the polyA tail of the transfected transcripts could potentially be very different than that of native mRNAs and also affect stability.

      The transcripts used in our study were polyadenylated in vitro with approximately 100 nucleotides 

      (Fig. S1C), similar to the polyA tail lengths typically observed in vivo (dx.doi.org/10.1016/j.molcel.2014.02.007).  Additionally, these transcripts were capped to emulate essential mRNA characteristics and to minimize immune responses in recipient cells. This design allows us to study RNA decay for in vitro-synthesized RNA delivered into human cells, akin to RNA vaccines, but it does not necessarily extend to endogenous RNAs. As mentioned, endogenous RNAs undergo nuclear processing and are decorated by numerous trans factors, resulting in distinct regulatory mechanisms. We therefore provided a more discussion on these differences and their implications in the revised manuscript: “However, while our approach effectively assesses the stability of synthesized RNA in human cells, it may not fully capture the decay dynamics of nuclear-synthesized RNA, which can be influenced by endogenous modifications and trans-acting RNA binding factors. (p. 18)”

      (2) The analysis approach used in this work for identifying regulatory features in UTRs was not previously used. As such, lack of in-depth details of the methodology, and possibly also more general validation of the approach, is a drawback in convincing the reader in the validity of this approach and its results.

      In particular, a main point that is not addressed is how the authors decide on the set of "factors" used in their analysis? As choosing different sets of factors might affect the results of the analysis. 

      In our study, we employed the calculation of the Variance Inflation Factor (VIF) as a basis for selecting variables. This well-established method is widely used to detect variables with high collinearity, thus ensuring the robustness and reliability of our analysis. By identifying and excluding highly collinear variables, we aimed to minimize multicollinearity and improve the accuracy of our regression models. For more detailed information on the use of VIF in regression analysis, please refer to Akinwande, M., Dikko, H., and Samson, A. (2015). Variance Inflation Factor: As a Condition for the Inclusion of Suppressor Variable(s) in Regression Analysis. Open Journal of Statistics, 5, 754-767. doi: 10.4236/ojs.2015.57075. We have included the method details in the revised manuscript (p. 28) :”… to avoid multicollinearity caused by similar features that perturb feature selection, all features were clustered using single-linkage hierarchical clustering with the distance metric defined as one minus the absolute value of the Spearman correlation coefficient. We cut the tree at a specific height, and the feature that had the greatest influence on RNA stability, which was examined using a simple linear regression model, was selected to be the representative of each cluster. Then we calculated the variance inflation factor (VIF) value of the representative features. The VIFs were obtained by the following linear model and equations:

      where and are the estimated value of the jth feature and the value of the kth feature of the ith UTR (note that the kth feature is a feature other than the jth feature), and are the intercept and the regression coefficients of the linear model that regressed the jth feature on the other remaining features, and is the mean level of the jth feature of all UTRs.”

      For example, the choice to use 7-mer sequences within the factors set is not explained, particularly when almost all motifs that are eventually identified (Figure 3B-E) are shorter.

      The known RBP motifs are primarily 6-mer. To explore the possibility of discovering novel motifs that could significantly impact our model, we started with 7-mer sequences. However, our analysis revealed that including these additional variables did not improve the explanatory power of the model; instead, it reduced it. Consequently, our final model focuses on motifs shorter than 7-mer. We explained the motif selections in the revised manuscript (p. 9): “Given our discovery that the effect of AREs is heavily dependent on sequence content, we decided to further explore the effects of other sequence elements, i.e., beyond known regulatory motifs, in more detail. Since most reported RBP motifs are 6-mers, we initiated a search for novel motifs by analyzing the presence of all 7-mers in our massively parallel reporter assay (MPRA) library, correlating their occurrence with mRNA half-life.”

      In addition, the authors do not perform validations to demonstrate the validity of their approach on simulated data or well-established control datasets. Such analysis would be helpful to further convince the reader in the usefulness and robustness of the analysis.

      We acknowledge the importance of validating our approach on simulated data or well-established control datasets to demonstrate its robustness and reliability. However, to the best of our knowledge, there are currently no well-established control datasets available that perfectly correspond to our specific study context. Despite this, we will continue to search for any relevant datasets that could be utilized for this purpose in future work. This effort will help to further reinforce the confidence in our methodology and its findings.

      (3) The analysis and regression models built in this work are not thoroughly investigated relative to native genes within cells. The effect of sequence "factors" on native cellular transcripts' stability is not investigated beyond TA di-nucleotides, and it is unclear to what degree do other predicted factors also affect native transcripts.

      Our system studies the stability control of RNA synthesized in vitro and delivered into human cells. While we validated the UTR UA-dinucleotide effect in vivo, we did not intend to conclude that this is the most influential regulation for endogenous RNAs. It is known that endogenous RNAs undergo very different regulation. The most prominent factors controlling endogenous RNA stability are the density of splice junctions and the length of UTRs (doi.org/10.1186/s13059-022-02811-x; doi.org/10.1186/s12915-021-00949-x). To decipher the sequence regulation, we controlled for these factors in our experiments. Therefore, we acknowledge that several endogenous features, which were excluded by our approach, may serve as predictive features of RNA half-life in vivo. 

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      Specific comments:  

      Some references are missing, e.g for the sentence:

      Please see the response below.

      "Similarly, point mutation of the GFPT1 3' UTR results in congenital myasthenic syndrome." (p5)

      The reference has been added to the text:

      Dusl, M., Senderek, J., Muller, J. S., Vogel, J. G., Pertl, A., Stucka, R., Lochmuller, H., David, R., & Abicht, A. (2015). A 3'-UTR mutation creates a microRNA target site in the GFPT1 gene of patients with congenital myasthenic syndrome. Human Molecular Genetics, 24(12), 34183426. https://doi.org/10.1093/hmg/ddv090 

      "...but there have been no systematic assessments of the explicit effects of variants of both UTRs on stability regulation." (not true in the current phrasing; e.g. PMIDs 32719458, 36156153, 34849835)

      These references have been added to the text. However, we have to point out that these studies do not focus on the effects of the disease-relevant variants. To clarify, we modified the sentence to "... systematic assessments of the explicit effects of disease-relevant variants in both UTRs on stability regulation are still absent."

      "Multiple approaches have revealed AREs as exerting a destabilizing effect on RNA stability (Barreau et al., 2005). (p8)

      The reference has been added to the text:

      Barreau, C., Paillard, L., & Osborne, H. B. (2005). AU-rich elements and associated factors: are there unifying principles? Nucleic Acids Research, 33(22), 7138-7150. https://doi.org/10.1093/nar/gki1012 

      "This effect is specific, as such ratios in the coding region are inconsequential." (p12)

      This refers to our findings of Fig. 4G and Supplemental Fig. S5F.

      What are the sequences at the 5' and 3'UTR without insertion of a library? 5'UTR library (especially in SH) has much longer half-life compared to 3'utr library (Fig S1D).

      There is no designed 5’UTR of the 3’UTR library, only the Kozak sequence derived from the pEGFPC1 vector. This may partially underlie the shorter half-life of the 3’ UTR library.

      Fig2A: What are the units? "half-life (log)" Do the numbers correspond to log10(min)?

      It represents ln (min). To clarify, we now use ‘ln t<sub>1/2</sub> (min)’ in all figures.

      Fig 2 and 3: This was done only on the wild-type sequences? Or all tested sequences together, wt and mut?

      It was done only on the wild-type sequences. To clarify, we modified the text to “we examined the effect of AREs on RNA stability of the ref alleles according to specific sequence content….(p.8)” and “We considered as many factors as possible to explain the half-life of our ref UTR libraries,…. (p.9)”. ‘ref’ stands for reference.

      "Furthermore, to avoid collinearity confounding our model, e.g., the effects of very similar factors (such as 'AA' and 'AAA' sequences), we clustered the factors according to their properties, and then only one representative factor from within a cluster (i.e., the one with the highest correlation to halflife within a cluster) was subjected to LASSO regression": Given the observed context dependence, e.g. in the case of poly-U stretches: Isn't this clustering leading to similar/identical motifs with different context being grouped together (such as polyU preceded by an A (strongly destabilizing, according to Fig 2B) or followed by one (strongly stabilizing, according to Fig 2B), resulting in ignoring the context or using one potential outcome while a motif from the same cluster can have the opposite effect?

      Thank you very much for pointing this out. To determine if considering different contextual effects within each feature cluster would enhance model performance, we modified our feature selection by choosing both the feature with the largest positive and the largest negative effect on RNA half-life in Step III of Figure 3A. We then split the data into a 2:1 training and testing set and repeated this process 100 times. Model performance was evaluated using mean average error (MAE), root mean squared error (RMSE), and adjusted R-squared. From Author response image 4, we observed no significant improvement in model performance using this new approach. Notably, in the SH-SY5Y 5' UTR model, our original method even outperformed the modified one, with statistically lower MAE and RMSE and a higher adjusted R-squared. Therefore, we believe our current approach remains appropriate.

      Author response image 4.

      "Overall, motifs that are at least two nucleotides long proved critical for RNA stability, supporting the sequence specificity of the decay process." Unclear why this supports the "sequence specificity"

      No monomers were selected as an explanatory factor. On the contrary, specific sequence combinations and order are important for the regulation. These findings suggest sequence-specific recognition for the decay process.

      Fig3: The same features were used in both cell lines? If yes: Since they were selected for their highest correlation with half-life, how was a common set chosen? If no: problematic to compare.

      Thank you for your question regarding feature selection across cell lines. Initially, the features were collected uniformly for both cell lines. However, subsequent feature selection steps were cell-type specific, focusing on identifying features with the greatest impact on RNA half-life in each context. This approach allows us to still compare model performance and discuss the similarities and differences in selected features across cell types. By maintaining a consistent starting point, we ensure that any observed differences reflect cell-specific regulatory dynamics.

      uORFs were not used as features?

      Thank you for pointing this out. At the beginning of our study, we investigated the impact of Kozak sequence strength (categorized as weak, moderate, strong, or optimal) on RNA half-life. However, we found that this feature performed poorly in predicting RNA stability, and as a result, we decided not to include upstream open reading frames (uORFs) or Kozak sequences in our subsequent analyses.

      Experimental reproducibility: Only correlations between replicates for the same time point is shown, but no comparison between time points or between decay rates. How reproducible were the paired differences between mut/wt?

      The decay rate was calculated by modeling the slope of a linear regression of all time points. Therefore, there is only one decay rate associated with a genotype. To rule out inconsistent data, we excluded any regression with a mean square error greater than 1, as this indicates a poor fit of the data points. 

      Fig 7C/p17: This does not establish a "causal relationship" as the authors claim.

      We agree with the reviewer’s suggestion. We have modified the text on p.17 to “to establish a correlation between UTR variants and health outcomes,…..”

      In the discussion, the authors claim that TA-diNTs are not only an opposite of the GC percentage and base this on Fig 5A.

      Fig 5A: The range of TA-diNTs is naturally much higher in the low GC group. To make the high and low GC content comparable (as the authors aim to do), the correlation should be assessed for the same range of TA dint in both cases.

      To address this concern more rigorously, we performed a stratified analysis based on UA-diNT rate. As shown in our Fig. S7C, even after stratifying by UA- dinucleotide ratio (upper panel high UA- dinucleotide ratio / lower panel low UA- dinucleotide ratio), we still observe that the destabilizing effect of UA is stronger in the low GC content group.

      Supplemental Figure S7. Interplay of GC content and TA dinucleotide on stability regulation, related to Figure 5. (C) Stratifications of both TA dinucleotide ratio and GC content showed that the destabilizing effect of TA dinucleotide is the most prominent under conditions of low TA dinucleotide ratio and low GC content. The same trend was observed for 5’ UTR (left) and 3’ UTR (right).

      The injection of in vitro transcribed and polyA/capped RNA certainly has advantages over other methods, but delivering naked mRNA without nuclear history might also lead to artifacts. The caveats of the approach should be discussed more extensively.

      We appreciate the suggestion and have hence added the following in the Discussion (p.18): “However, while our approach effectively assesses the stability of synthesized RNA in human cells, it may not fully capture the decay dynamics of nuclear-synthesized RNA, which can be influenced by endogenous modifications and trans-acting RNA binding factors.”

      "We unexpectedly identified many crucial regulatory features in 5' UTRs." Why was this unexpected?

      We initially thought the 3’ UTR would play a major role in stability regulation. To avoid confusion, we have removed the word ‘unexpected’ from the text (p. 20): "We identified many crucial regulatory features in 5' UTRs."

      "...a massively parallel reporter assay in which coding regions and human 5'/3' UTRs with diseaserelevant mutations were generated in vitro and then directly transfected into human cell lines to assess their decay patterns by next‐generation sequencing": also coding regions?

      Thanks for the question. Indeed, the coding region was not synthesized together with the UTR library. Therefore, we modified the text of p. 6 to “…we developed a massively parallel reporter assay in which human 5’/3’ UTRs with disease-relevant mutations were generated in vitro, ligated with the enhanced green fluorescence protein (EGFP) coding region, and then directly transfected into human cell lines to assess their decay patterns by next-generation sequencing.”

      Reviewer #2 (Recommendations For The Authors):

      Nomenclature: When discussing RNA sequences, "U" should be used in place of "T" (e.g., "UA dinucleotide").

      We have replaced the RNA sequence “T” with “U” of the text and figures.

      Abstract: "We examined the RNA degradation patterns mediated by the UTR library in multiple cell lines" - It would be clearer to state that two cell lines (rather than multiple) were used.

      We appreciate the suggestion. We have modified the abstract as suggested: “We examined the RNA degradation patterns mediated by the UTR library in two cell lines…"

      The manuscript refers to "wild-type (WT) and mutant (mt) alleles." (p. 7 and elsewhere). It would be better to use "reference" instead of "wild type" given that these are human populations.

      We appreciate the suggestion. All instances of ‘wild-type’ or ‘WT’ in the text and figures have been replaced with ‘reference’ or ‘ref’.

      In the introduction, it is stated that traditional MPRAs "cannot differentiate the effect of the UTRs on transcription, stability and, in some cases, even protein production, greatly limiting scientific interpretation." This is confusing, since these assays can and have been used in association with both RNA decay measurements and measurements of reporter protein levels that allow assessment of effects on stability and protein production (including in the cited references).

      We reason that the RNA steady-state level (e.g., sequencing the overall RNA normalized to DNA) or protein steady-state level (e.g., detecting the fluorescence signal) does not precisely reveal the decay kinetics of the RNA. Steady-state level is a result of production and decay, both of which UTRs contribute to. Similarly, the protein level is not a perfect estimate of the RNA decay.

      To clarify, we have modified the introduction (p. 5) to “Nevertheless, because the steady-state level is a result of production and decay, these approaches cannot differentiate the effect of the UTRs on transcription, stability and, in some cases, even protein production, greatly limiting scientific interpretation.” 

      Adding raw and normalized read count data from individual experiments (e.g., to Table S1) would make it more likely for others to use this dataset to address additional questions.

      All raw and processed sequencing data generated in this study have been submitted to the NCBI Gene Expression Omnibus (GEO; https://www.ncbi.nlm.nih.gov/geo/) under accession number GSE217518 (reviewer token snspaakujtsdpcv).

      The manuscript would benefit from further clarification about model selection. Additional details regarding how the features were clustered, and the actual clusters themselves should be included.

      It should be discussed why Lasso was chosen vs Ridge or Elastic Net, in the context of handling multicollinearity. Often, data is subsetted for training and validation, and model performance metrics are presented.

      Thank you for pointing out the need for further clarification on model selection. The features were clustered using single-linkage hierarchical clustering with the distance metric defined as one minus the absolute value of the Spearman correlation coefficient (this information has been added to the manuscript on p. 28: “…to avoid multicollinearity caused by similar features that perturb feature selection, all features were clustered using single-linkage hierarchical clustering with the distance metric defined as one minus the absolute value of the Spearman correlation coefficient.”). The resulting feature clusters are available in Supplemental Table S3. 

      Regarding model selection, we chose LASSO over ridge and elastic net primarily for feature selection, as ridge does not perform feature selection. Elastic net is essentially a hybrid of ridge regression and LASSO regularization, but we opted for LASSO for its simplicity and effectiveness in selecting a sparse set of important features.

      We also performed a 2:1 training and testing set analysis and have included these details in the manuscript. Model performance metrics, including correlation coefficient between observed and predicted values in the testing set, mean absolute error (MAE), root mean squared error (RMSE), mean absolute percentage error (MAPE), and R-squared, are provided in new Supplemental Table S4.

      Recommend reviewing and correcting verb tenses in the methods section.

      We appreciate the reviewer’s suggestion. We have corrected verb tenses in the methods section, which includes “The UTRs were defined by NCBI RefSeq and ENCODE V27. (p.21)”, “The variant was placed in the middle of the sequence….(p.22)”, and “eCLIP signals with value < 1 or p value > 0.05 were removed. (p.26)”

      Please add information about which cell type(s) are being used in each of the figure legends (e.g., in Figs. 2B and 5).

      We appreciate the reviewer’s suggestion. We have added the cell type information in the figure legends: “Figure 2…. (B) The ten most influential AREs in terms of RNA stability in SH-SY5Y cells.” And “Figure 5…..(A) MPRA data of SH-SY5Y cells stratified according to the GC content (GC%) of UTRs.”

      Recommend review of axis labels and consistency in formatting the log(half-lives) and including the base of the log and the time unit (minutes). Even better, converting axis labels from log minutes to minutes would make this easier to understand.

      Thank you for the suggestion regarding axis labels and consistency. We have unified the half-life label to ‘ln t<sub>1/2</sub> (min)’ in all figures. We chose not to convert the axis from logarithmic minutes to minutes because the original scale is highly skewed, which would hinder clear data visualization.

      The discussion refers to Figure 1D but Figure 1 only has A-C

      Thank you for pointing out this mistake. ‘Fig. 1D’ has been changed to ‘Fig. 1B’ in the text (p. 7 and p. 20).

      The analyses in Fig. 2 are interpreted as demonstrating that AREs destabilize RNAs. These analyses are examining associations, so it would be more appropriate to say that AREs are associated with destabilization (since it is formally possible that other sequences that are present in these UTR fragment cause destabilization). A similar issue arises on p. 10: "TA dinucleotides alone can negatively regulate RNA stability, with a Pearson's correlation coefficient of ‐0.287 for 5' UTRs and ‐0.377 for 3' UTRs (Fig. 4A,C)." This is an association and does not establish causation. Again on p. 17: "We identified several SNPs in UTRs that induce aberrant RNA expression and/or protein expression (Supplemental Table S7)." These may be causal but may simply be in LD with other variants that are causal.

      We agree that the association observed is not proven to be causal. Therefore, we modified the text as suggested: 

      “AUUUA/AUUA-containing AREs are associated with RNA destabilization.” (p. 8)

      “UA dinucleotides alone present a negative correlation with RNA stability, with a Pearson’s correlation coefficient of -0.287 for 5’ UTRs and -0.377 for 3’ UTRs.”  (p.10)

      “We identified several SNPs in UTRs that correlated with aberrant RNA expression and/or protein expression.”  (p. 17)

      Figure 4C is important in that it examines whether variant sequences that differ in a manner that changes the number of dinucleotide repeats affect stability. Please show the number (not just the percentage) of sequences in each category.

      Thank you for your insightful comment. We believe the figure you referred to is Figure 4E. We have updated the figure to include the number of sequences in each category.

      Figure 6A and B: The horizontal axes appear to be misaligned since the dotted vertical lines do not cross at 0. ?

      The dotted vertical lines represent the genomic background of the UA-diNT ratio. To clarify it, we have modified the legend to: “Figure 6……(A) The top ten biological processes for which the 5’ UTR UA-dinucleotide ratio most significantly deviated from the genomic background (dashed line).”

      It may be helpful to state what the dashed and solid lines represent on Figure 6 E/F. Please correct spelling of "Biological" in 6E.

      As per the reviewer’s suggestions, we have modified the legend of Figure 6 to: “………..(E) Biological processes for RNAs in which the UA-dinucleotide ratios of both 5’ and 3’ UTRs are significantly different from the genomic background (dashed lines). (F) Molecular functions for RNAs in which the UA-dinucleotide ratios of both 5’ and 3’ UTRs are significantly different from the genomic background (dashed lines). The thin solid lines represent the standard deviation of the UAdinucleotide ratio within the gene group.” 

      In addition, the spelling of “Biological” in Fig. 6E has been corrected.

      Reviewer #3 (Recommendations For The Authors):

      I have 3 points that I think could improve science and its presentation within the manuscript.

      (1) Most importantly, how well do LASSO regression models predict the stability of native transcripts? Such analysis can also be useful for comparison between two different cell-types. How well does the regression model learned (on reporters) within one cell-type predict mRNA stability (of reporters and native genes) in this cell-type and in the other cell-type? Similarly, models can also help to analyze the effects of 5'UTR and 3'UTR sequences on mRNA stability. In particular, how well does the regression model of each separate regulatory sequence (3'UTR or 5'UTR) is able to predict the stability of native genes in the cell? Can the predictions be improved by combining both 3'UTR and 5'UTR sequence features within the regression models?

      The decay model for native transcripts has been established in prior research (doi.org/10.1186/s13059-022-02811-x; doi.org/10.1186/s12915-021-00949-x), which indicates that exon junction density and transcript length are the primary determinants of RNA stability. Based on these findings, we designed the MPRA with fixed length and without splicing to focus on the contribution of primary sequences. We validated the destabilizing effect of UA dinucleotide on endogenous RNAs (Fig. 4G and Supplemental Fig. S5F) but do not recommend using our model to fully explain or predict the stability of native transcripts.

      To assess the model's cross-cell type predictive performance for RNA half-life, we employed the Regression Error Characteristic (REC) curve (Bi & Bennett, 2003). Similar to the receiver operating characteristic (ROC) curve, the REC curve illustrates the trade-off between error tolerance and accuracy, with better performance indicated by curves trending toward the upper left. We also computed the Area Over the Curve (AOC) as a performance metric, where lower values indicate better predictive ability. From Author response image 5, the REC curves reveal that cross-cell type prediction performance is suboptimal. The y-axis represents prediction accuracy, while the x-axis denotes error tolerance for the natural logarithm of RNA half-life (ln(𝑡<sub>1/2</sub>), in minutes).

      Author response image 5.

      In response to the suggestion of combining 5' and 3' UTR sequence features in the regression model, we believe this approach may not be ideal. As shown in Figure S1D, the distribution of RNA half-lives between 5' and 3' UTRs is significantly different, reflecting their distinct regulatory roles. Additionally, the base composition differs, with 5' UTRs having a higher GC content compared to 3' UTRs. Combining these datasets would likely make the origin of the sequence (5' or 3' UTR) the most predictive feature, thereby reducing the model's interpretability. Furthermore, our MPRA results, derived from separate 5’ or 3’ UTR library, do not support a combined model, further suggesting this approach may not be suitable with our data.

      The conclusions regarding genetic variants are interesting, yet the main strength of the work involves identifying general sequence features that affect mRNA stability rather than specific variants. I wonder if the authors have considered to shift the focus of the motivation part to reflect that?

      We appreciated the reviewer’s suggestion. We have revised the abstract and introductions to emphasize the general UTR regulation. Here is the revised abstract:

      UTRs contain crucial regulatory elements for RNA stability, translation and localization, so their integrity is indispensable for gene expression. Approximately 3.7% of genetic variants associated with diseases occur in UTRs, yet a comprehensive understanding of UTR variant functions remains limited due to inefficient experimental and computational assessment methods. To systematically evaluate the effects of UTR variants on RNA stability, we established a massively parallel reporter assay on 6,555 UTR variants reported in human disease databases. We examined the RNA degradation patterns mediated by the UTR library in two cell lines, and then applied LASSO regression to model the influential regulators of RNA stability. We found that UA dinucleotides and UA-rich motifs are the most prominent destabilizing element. Gain of UA dinucleotide outlined mutant UTRs with reduced stability. Studies on endogenous transcripts indicate that high UA-dinucleotide ratios in UTRs promote RNA degradation. Conversely, elevated GC content and protein binding on UA dinucleotides protect high-UA RNA from degradation. Further analysis reveals polarized roles of UA-dinucleotide-binding proteins in RNA protection and degradation. Furthermore, the UA-dinucleotide ratio of both UTRs is a common characteristic of genes in innate immune response pathways, implying a coordinated stability regulation through UTRs at the transcriptomic level. We also demonstrate that stability-altering UTRs are associated with changes in biobank-based health indices, underscoring the importance of precise UTR regulation for wellness. Our study highlights the importance of RNA stability regulation through UTR primary sequences, paving the way for further exploration of their implications in gene networks and precision medicine.

      Plots presenting correlations (e.g., Figure 4A, 4C) are more informative when plotted as density plots (i.e., using colorscale to show density of the dots at each part of the plot).

      We greatly appreciate the reviewer's insightful suggestion regarding the use of density plots for presenting correlations. We have modified Figures 4A and 4C in the revised manuscript to implement density plotting. The updated figures now utilize a colorscale that highlights areas of high and low data density.

    1. Author Response

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      Qin et al. set out to investigate the role of mechanosensory feedback during swallowing and identify neural circuits that generate ingestion rhythms. They use Drosophila melanogaster swallowing as a model system, focusing their study on the neural mechanisms that control cibarium filling and emptying in vivo. They find that pump frequency is decreased in mutants of three mechanotransduction genes (nompC, piezo, and Tmc), and conclude that mechanosensation mainly contributes to the emptying phase of swallowing. Furthermore, they find that double mutants of nompC and Tmc have more pronounced cibarium pumping defects than either single mutants or Tmc/piezo double mutants. They discover that the expression patterns of nompC and Tmc overlap in two classes of neurons, md-C and md-L neurons. The dendrites of md-C neurons warp the cibarium and project their axons to the subesophageal zone of the brain. Silencing neurons that express both nompC and Tmc leads to severe ingestion defects, with decreased cibarium emptying. Optogenetic activation of the same population of neurons inhibited filling of the cibarium and accelerated cibarium emptying. In the brain, the axons of nompC∩Tmc cell types respond during ingestion of sugar but do not respond when the entire fly head is passively exposed to sucrose. Finally, the authors show that nompC∩Tmc cell types arborize close to the dendrites of motor neurons that are required for swallowing, and that swallowing motor neurons respond to the activation of the entire Tmc-GAL4 pattern.

      Strengths:

      • The authors rigorously quantify ingestion behavior to convincingly demonstrate the importance of mechanosensory genes in the control of swallowing rhythms and cibarium filling and emptying

      • The authors demonstrate that a small population of neurons that express both nompC and Tmc oppositely regulate cibarium emptying and filling when inhibited or activated, respectively

      • They provide evidence that the action of multiple mechanotransduction genes may converge in common cell types

      Thank you for your insightful and detailed assessment of our work. Your constructive feedback will help to improve our manuscript.

      Weaknesses:

      • A major weakness of the paper is that the authors use reagents that are expressed in both md-C and md-L but describe the results as though only md-C is manipulated-Severing the labellum will not prevent optogenetic activation of md-L from triggering neural responses downstream of md-L. Optogenetic activation is strong enough to trigger action potentials in the remaining axons. Therefore, Qin et al. do not present convincing evidence that the defects they see in pumping can be specifically attributed to md-C.

      Thank you for your comments. This is important point that we did not adequately address in the original preprint. We have obtained imaging and behavioral results that strongly suggest md-C, rather than md-L, are essential for swallowing behavior.

      36 hours after the ablation of the labellum, the signals of md-L were hardly observable when GFP expression was driven by the intersection between Tmc-GAL4 & nompC-QF (see F Figure 3—figure supplement 1A). This observation indicates that the axons of md-L likely degenerated after 36 hours, and were unlikely to influence swallowing. Moreover, the projecting pattern of Tmc-GAL4 & nompC-QF>>GFP exhibited no significant changes in the brain post labellum ablation.

      Furthermore, even after labellum ablation for 36 hours, flies exhibited responses to light stimulation (see Figure 3—figure supplement 1B-C, Video 5) when ReaChR was expressed in md-C. We thus reasoned that md-C but not md-L, plays a crucial role in the swallowing process.

      • GRASP is known to be non-specific and prone to false positives when neurons are in close proximity but not synaptically connected. A positive GRASP signal supports but does not confirm direct synaptic connectivity between md-C/md-L axons and MN11/MN12.

      In this study, we employed the nSyb-GRASP, wherein the GRASP is expressed at the presynaptic terminals by fusion with the synaptic marker nSyb. This method demonstrates an enhanced specificity compared to the original GRASP approach.

      Additionally, we utilized +/ UAS-nSyb-spGFP1-10, lexAop-CD4-spGFP11 ; + / MN-LexA fruit flies as a negative control to mitigate potential false signals originating from the tool itself (Author response image 1, scale bar = 50μm). Beside the genotype Tmc-Gal4, Tub(FRT. Gal80) / UAS-nSyb-spGFP1-10, lexAop-CD4-spGFP11 ; nompC-QF, QUAS-FLP / MN-LexA fruit flies discussed in this manuscript, we also incorporated genotype Tmc-Gal4, Tub(FRT. Gal80) / lexAop-nSyb-spGFP1-10, UAS-CD4-spGFP11 ; nompC-QF, QUAS-FLP / MN-LexA fruit flies as a reverse control (Author response image 2). Unexpectedly, similar positive signals were observed, indicating that, positive signals may emerge due to close proximity between neurons even with nSyb-GRASP.

      Author response image 1.

      It should be noted that the existence of synaptic projections from motor neurons (MN) to md-C cannot be definitively confirmed at this juncture. At present, we can only posit the potential for synaptic connections between md-C and motor neurons. A more conclusive conclusion may be attainable with the utilization of comprehensive whole-brain connectome data in future studies.

      Author response image 2.

      • As seen in Figure 2—figure supplement 1, the expression pattern of Tmc-GAL4 is broader than md-C alone. Therefore, the functional connectivity the authors observe between Tmc expressing neurons and MN11 and 12 cannot be traced to md-C alone

      It is true that the expression pattern of Tmc-GAL4 is broader than that of md-C alone. Our experiments, including those flies expressing TNT in Tmc+ neurons, demonstrated difficulties in emptying (Figure 2A, 2D). Notably, we encountered challenges in finding fly stocks bearing UAS>FRT-STOP-P2X2. Consequently, we opted to utilize Tmc-GAL4 to drive UAS-P2X2 instead. We believe that the results further support our hypothesis on the role of md-C in the observed behavioral change in emptying.

      Overall, this work convincingly shows that swallowing and swallowing rhythms are dependent on several mechanosensory genes. Qin et al. also characterize a candidate neuron, md-C, that is likely to provide mechanosensory feedback to pumping motor neurons, but the results they present here are not sufficient to assign this function to md-C alone. This work will have a positive impact on the field by demonstrating the importance of mechanosensory feedback to swallowing rhythms and providing a potential entry point for future investigation of the identity and mechanisms of swallowing central pattern generators.

      Reviewer #2 (Public Review):

      In this manuscript, the authors describe the role of cibarial mechanosensory neurons in fly ingestion. They demonstrate that pumping of the cibarium is subtly disrupted in mutants for piezo, TMC, and nomp-C. Evidence is presented that these three genes are co-expressed in a set of cibarial mechanosensory neurons named md-C. Silencing of md-C neurons results in disrupted cibarial emptying, while activation promotes faster pumping and/or difficulty filling. GRASP and chemogenetic activation of the md-C neurons is used to argue that they may be directly connected to motor neurons that control cibarial emptying.

      The manuscript makes several convincing and useful contributions. First, identifying the md-C neurons and demonstrating their essential role for cibarium emptying provides reagents for further studying this circuit and also demonstrates the important of mechanosensation in driving pumping rhythms in the pharynx. Second, the suggestion that these mechanosensory neurons are directly connected to motor neurons controlling pumping stands in contrast to other sensory circuits identified in fly feeding and is an interesting idea that can be more rigorously tested in the future.

      At the same time, there are several shortcomings that limit the scope of the paper and the confidence in some claims. These include:

      a) the MN-LexA lines used for GRASP experiments are not characterized in any other way to demonstrate specificity. These were generated for this study using Phack methods, and their expression should be shown to be specific for MN11 and MN12 in order to interpret the GRASP experiments.

      Thanks for the suggestion. We have checked the expression pattern of MN-LexA, which is similar to MN-GAL4 used in previous work (Manzo et al., PNAS., 2012, PMID:22474379) . Here is the expression pattern:

      Author response image 3.

      b) There is also insufficient detail for the P2X2 experiment to evaluate its results. Is this an in vivo or ex vivo prep? Is ATP added to the brain, or ingested? If it is ingested, how is ATP coming into contact with md-C neuron if it is not a chemosensory neuron and therefore not exposed to the contents of the cibarium?

      The P2X2 experimental preparation was done ex vivo. We immersed the fly in the imaging buffer, as described in the Methods section under Functional Imaging. Following dissection and identification of the subesophageal zone (SEZ) area under fluorescent microscopy, we introduced ATP slowly into the buffer, positioned at a distance from the brain

      c) In Figure 3C, the authors claim that ablating the labellum will remove the optogenetic stimulation of the md-L neuron (mechanosensory neuron of the labellum), but this manipulation would presumably leave an intact md-L axon that would still be capable of being optogenetically activated by Chrimson.

      Please refer to the corresponding answers for reviewer 1 and Figure 3—figure supplement 1.

      d) Average GCaMP traces are not shown for md-C during ingestion, and therefore it is impossible to gauge the dynamics of md-C neuron activation during swallowing. Seeing activation with a similar frequency to pumping would support the suggested role for these neurons, although GCaMP6s may be too slow for these purposes.

      Profiling the dynamics of md-C neuron activation during swallowing is crucial for unraveling the operational model of md-C and validating our proposed hypothesis. Unfortunately, our assay faces challenges in detecting probable 6Hz fluorescent changes with GCaMP6s.

      In general, we observed an increase of fluorescent signals during swallowing, but movement of alive flies during swallowing influenced the imaging recording, so we could not depict a decent tracing for calcium imaging for md-C neurons. To enhance the robustness of our findings, patching the md-C neurons would be a more convincing approach. As illustrated in Figure 2, the somata of md-C neurons are situated in the cibarium rather than the brain. patching of the md-C neuron somata in flies during ingestion is difficult.

      e) The negative result in Figure 4K that is meant to rule out taste stimulation of md-C is not useful without a positive control for pharyngeal taste neuron activation in this same preparation.

      We followed methods used in the previous work (Chen et al., Cell Rep., 2019, PMID:31644916), which we believe could confirm that md-C do not respond to sugars.

      In addition to the experimental limitations described above, the manuscript could be organized in a way that is easier to read (for example, not jumping back and forth in figure order).

      Thanks for your suggestion and the manuscript has been reorganized.

      Reviewer #3 (Public Review):

      Swallowing is an essential daily activity for survival, and pharyngo-laryngeal sensory function is critical for safe swallowing. In Drosophila, it has been reported that the mechanical property of food (e.g. Viscosity) can modulate swallowing. However, how mechanical expansion of the pharynx or fluid content sense and control swallowing was elusive. Qin et al. showed that a group of pharyngeal mechanosensory neurons, as well as mechanosensory channels (nompC, Tmc, and Piezo), respond to these mechanical forces for regulation of swallowing in Drosophila melanogaster.

      Strengths:

      There are many reports on the effect of chemical properties of foods on feeding in fruit flies, but only limited studies reported how physical properties of food affect feeding especially pharyngeal mechanosensory neurons. First, they found that mechanosensory mutants, including nompC, Tmc, and Piezo, showed impaired swallowing, mainly the emptying process. Next, they identified cibarium multidendritic mechanosensory neurons (md-C) are responsible for controlling swallowing by regulating motor neuron (MN) 12 and 11, which control filling and emptying, respectively.

      Weaknesses:

      While the involvement of md-C and mechanosensory channels in controlling swallowing is convincing, it is not yet clear which stimuli activate md-C. Can it be an expansion of cibarium or food viscosity, or both? In addition, if rhythmic and coordinated contraction of muscles 11 and 12 is essential for swallowing, how can simultaneous activation of MN 11 and 12 by md-C achieve this? Finally, previous reports showed that food viscosity mainly affects the filling rather than the emptying process, which seems different from their finding.

      We have confirmed that swallowing sucrose water solution activated md-C neurons, while sucrose water solution alone could not (Figure 4J-K). We hypothesized that the viscosity of the food might influence this expansion process.

      While we were unable to delineate the activation dynamics of md-C neurons, our proposal posits that these neurons could be activated in a single pump cycle, sequentially stimulating MN12 and MN11. Another possibility is that the activation of md-C neurons acts as a switch, altering the oscillation pattern of the swallowing central pattern generator (CPG) from a resting state to a working state.

      In the experiments with w1118 flies fed with MC (methylcellulose) water, we observed that viscosity predominantly affects the filling process rather than the emptying process, consistent with previous findings. This raises an intriguing question. Our investigation into the mutation of mechanosensitive ion channels revealed a significant impact on the emptying process. We believe this is due to the loss of mechanosensation affecting the vibration of swallowing circuits, thereby influencing both the emptying and filling processes. In contrast, viscosity appears to make it more challenging for the fly to fill the cibarium with food, primarily attributable to the inherent properties of the food itself.

      Reviewer #4 (Public Review):

      A combination of optogenetic behavioral experiments and functional imaging are employed to identify the role of mechanosensory neurons in food swallowing in adult Drosophila. While some of the findings are intriguing and the overall goal of mapping a sensory to motor circuit for this rhythmic movement are admirable, the data presented could be improved.

      The circuit proposed (and supported by GRASP contact data) shows these multi-dendritic neurons connecting to pharyngeal motor neurons. This is pretty direct - there is no evidence that they affect the hypothetical central pattern generator - just the execution of its rhythm. The optogenetic activation and inhibition experiments are constitutive, not patterned light, and they seem to disrupt the timing of pumping, not impose a new one. A slight slowing of the rhythm is not consistent with the proposed function.

      Motor neurons implicated in patterned motions can be considered effectors of Central Pattern Generators (CPGs)(Marder et al., Curr Biol., 2001, PMID: 11728329; Hurkey et al., Nature., 2023, PMID:37225999). Given our observation of the connection between md-C neurons and motor neurons, it is reasonable to speculate that md-C neurons influence CPGs. Compared to the patterned light (0.1s light on and 0.1s light off) used in our optogenetic experiments, we noted no significant changes in their responses to continuous light stimulation. We think that optogenetic methods may lead to overstimulation of md-C neurons, failing to accurately mimic the expansion of the cibarium during feeding.

      Dysfunction in mechanosensitive ion channels or mechanosensory neurons not only disrupts the timing of pumping but also results in decreased intake efficiency (Figure 1E). The water-swallowing rhythm is generally stable in flies, and swallowing is a vital process that may involve redundant ion channels to ensure its stability.

      The mechanosensory channel mutants nompC, piezo, and TMC have a range of defects. The role of these channels in swallowing may not be sufficiently specific to support the interpretation presented. Their other defects are not described here and their overall locomotor function is not measured. If the flies have trouble consuming sufficient food throughout their development, how healthy are they at the time of assay? The level of starvation or water deprivation can affect different properties of feeding - meal size and frequency. There is no description of how starvation state was standardized or measured in these experiments.

      Defects in mechanosensory channel mutants nompC, piezo, and TMC, have been extensively investigated (Hehlert et al., Trends Neurosci., 2021, PMID:332570000). Mutations in these channels exhibit multifaceted effects, as illustrated in our RNAi experiments (see Figure 2E). Deprivation of water and food was performed in empty fly vials. It's important to note that the duration of starvation determines the fly's willingness to feed but not the pump frequency (Manzo et al., PNAS., 2012, PMID:22474379).

      In most cases, female flies were deprived water and food in empty vials for 24 hours because after that most flies would be willing to drink water. The deprivation time is 12 hours for flies with nompC and Tmc mutated or flies with Kir2.1 expressed in md-C neurons, as some of these flies cannot survive 24h deprivation.

      The brain is likely to move considerably during swallow, so the GCaMP signal change may be a motion artifact. Sometimes this can be calculated by comparing GCaMP signal to that of a co-expressed fluorescent protein, but there is no mention that this is done here. Therefore, the GCaMP data cannot be interpreted.

      We did not co-express a fluorescent protein with GCaMP for md-C. The head of the fly was mounted onto a glass slide, and we did not observe significant signal changes before feeding.

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      .>Abstract: I disagree that swallow is the first step of ingestion. The first paragraph also mentions the final checkpoint before food ingestion. Perhaps sufficient to say that swallow is a critical step of ingestion.

      Indeed, it is not rigorous enough to say “first step”. This has been replaced by “early step”.

      Introduction:

      Line 59: "Silence" should be "Silencing"

      This has been replaced.

      Results:

      Lines 91-92: I am not clear about what this means. 20% of nompC and 20% of wild-type flies exhibit incomplete filling? So nompC is not different from wild-type?

      Sorry for the mistake. Viscous foods led to incomplete emptying (not incomplete filling), as displayed in Video 4. The swallowing behavior differs between nompC mutants and wild-type flies, as illustrated in Figure 1C, Figure 1—figure supplement 1A-C and video 1&5.

      When fed with 1% MC water solution (Figure 1—figure supplement 1E-H). We found that when fed with 1% MC watere solution, Tmc or piezo mutants displayed incomplete emptying, which could constitute a long time proportion of swallowing behavior; while only 20% of nompC flies and 20% of wild-type flies sporadically exhibit incomplete emptying, which is significantly different. Though the percent of flies displaying incomplete pump is similar between nompC mutant and wild-type files, you can find it quite different in video 1 and 5.

      Line 94: Should read: “while for foods with certain viscosity, the pump of Tmc or piezo mutants might"

      What evidence is there for weakened muscle motion? The phenotypes of all three mutants is quite similar, so concluding that they have roles in initiation versus swallowing strength is not well supported -this would be better moved to the discussion since it is speculative.

      Muscles are responsible for pumping the bolus from the mouth to the crop. In the case of Tmc or piezo mutants, as evidenced by incomplete filling for viscous foods (see Video 4), we speculate that the loss of sensory stimuli leads to inadequate muscle contraction. The phenotypes observed in Tmc and piezo mutants are similar yet distinct from those of the wild-type or nompC mutant, as shown in Video 1 and 4. The phrase "due to weakened muscle motion" has been removed for clarity.

      Line 146: If md-L neurons are also labeled by this intersection, then you are not able to know whether the axons seen in the brain are from md-L or md-C neurons. Line 148: cutting the labellum is not sufficient to ablate md-L neurons. The projections will still enter the brain and can be activated with optogenetics, even after severing the processes that reside in the labellum.

      Please refer to the responses for reviewer #1 (Public Review):” A major weakness of the paper…” and Figure 4.

      Line 162: If the fly head alone is in saline, do you know that the sucrose enters the esophagus? The more relevant question here is whether the md-C neurons respond to mechanical force. If you could artificially inflate the cibarium with air and see the md-C neurons respond that would be a more convincing result. So far you only know that these are activated during ingestion, but have not shown that they are activated specifically by filling or emptying. In addition, you are not only imaging md-C (md-L is also labeled). This caveat should be mentioned.

      We followed the methods outlined in the previous work (Chen et al., Cell Rep., 2019, PMID:31644916), which suggested that md-C neurons do not respond to sugars. While we aimed to mechanically stimulate md-C neurons, detecting signal changes during different steps of swallowing is challenging. This aspect could be further investigated in subsequent research with the application of adequate patch recording or two-photon microscopy (TPM).

      Figure 3: It is not clear what the pie charts in Figure 3 A refer to. What are the three different rows, and what does blue versus red indicate?

      Figure 3A illustrates three distinct states driven by CsChrimson light stimulation of md-C neurons, with the proportions of flies exhibiting each state. During light activation, flies may display difficulty in filling, incomplete filling, or a normal range of pumping. The blue and red bars represent the proportions of flies showing the corresponding state, as indicated by the black line.

      Figure 4: Where are the example traces for J? The comparison in K should be average dF/F before ingestion compared with average dF/F during ingestion. Comparing the in vitro response to sucrose to the in vivo response during ingestion is not a useful comparison.

      Please refer to the answers for reviewer #2 question d).

      Reviewer #2 (Recommendations For The Authors):

      Suggested experiments that would address some of my concerns listed in the public review include:

      a) high resolution SEZ images of MN-LexA lines crossed to LexAop-GFP to demonstrate their specificity

      b) more detail on the P2X2 experiment. It is hard to make suggestions beyond that without first seeing the details.

      c) presenting average GCaMP traces for all calcium imaging results

      d) to rule out taste stimulation of md-C (Figure 4K) I would suggest performing more extensive calcium imaging experiments with different stimuli. For example, sugar, water, and increasing concentrations of a neutral osmolyte (e.g. PEG) to suppress the water response. I think that this is more feasible than trying to get an in vitro taste prep to be convincing.

      Please refer to the responses for public review of reviewer #2.

      Reviewer #3 (Recommendations For The Authors):

      Below I list my suggestions as well as criticisms.

      (1) It would be excellent if the authors could demonstrate whether varying levels of food viscosity affect md-C activation.

      That is a good point, and could be studied in future work.

      (2) It is not clear whether an intersectional approach using TMC-GAL4 and nompC-QF abolishes labelling of the labellar multidendritic neurons. If this is the case, please show labellar multidendritic neurons in TMC-GAL4 only flies and flies using the intersectional approach. Along with this question, I am concerned that labellum-removed flies could be used for feeding assay.

      Intersectional labelling using TMC-GAL4 and nompC-QF could not abolish labelling of the labellar multidendritic neurons (Author response image 4). Labellum-removed flies could be used for feeding assay (Figure 3—figure supplement 1B-C, video 5), but once LSO or cibarium of fly was damaged, swallowing behavior would be affected. Removing labellum should be very careful.

      Author response image 4.

      (3) Please provide the detailed methods for GRASP and include proper control.

      Please refer to the responses for public review of reviewer #1.

      (4) The authors hypothesized that md-C sequentially activates MN11 and 12. Is the time gap between applying ATP on md-C and activation of MN11 or MN12 different? Please refer to the responses for public review of reviewer #3. The time gap between applying ATP on md-C and activation of MN11 or MN12 didn’t show significant differences, and we think the reason is that the ex vivo conditions could not completely mimic in vivo process.

      I found the manuscript includes many errors, which need to be corrected.

      (1) The reference formatting needs to be rechecked, for example, lines 37, 42, and 43.

      (2) Line 44-46: There is some misunderstanding. The role of pharyngeal mechanosensory neurons is not known compared with chemosensory neurons.

      (3) Line 49: Please specify which type of quality of food. Chemical or physical?

      (4) Line 80 and Figure 1B-D Authors need to put filling and emptying time data in the main figure rather than in the supplementary figure. Otherwise, please cite the relevant figures in the text(S1A-C).

      (5) Line 84-85; Is "the mutant animals" indicating only nompC? Please specify it.

      (6) Figure 1a: It is hard to determine the difference between the series of images. And also label filling and emptying under the time.

      (7) S1E-H: It is unclear what "Time proportion of incomplete pump" means. Please define it.

      (8) Please reorganize the figures to follow the order of the text, for example, figures 2 and 4

      (9) Figure 4A. There is mislabelling in Figure 4A. It is supposed to be phalloidin not nc82.

      (10) Figure 4K: It does not match the figure legend and main text.

      (11) Figure 4D and G: Please indicate ATP application time point.

      Thanks for your correction and all the points mentioned were revised.

      Reviewer #4 (Recommendations For The Authors):

      The figures need improvement. 1A has tiny circles showing pharynx and any differences are unclear.

      The expression pattern of some of these drivers (Supplement) seems quite broad. The tmc nompC intersection image in Figure 1F is nice but the cibarium images are hard to interpret: does this one show muscle expression? What are "brain" motor neurons? Where are the labellar multi-dendritic neurons?

      Tmc nompC intersection image show no expression in muscles. Somata of motor neurons 12 or 11 situated at SEZ area of brain, while somata of md-C neurons are in the cibarium. Image of md-L neurons was posted in response for reviewer #3 (Recommendations For The Authors):

      Why do the assays alternate between swallowing food and swallowing water?

      Thank for your suggestion, figure 1A has been zoomed-in. The Tmc nompC intersection image in Figure 2F displayed the position of md-C neurons in a ventral perspective, and muscles were not labelled. We stained muscles in cibarium by phalloidin and the image is illustrated in Figure 4A, while we didn’t find overlap between md-C neurons and muscles. Image of md-L neurons were posted as Author response image 4.

      In the majority of our experiments, we employed water to test swallowing behavior, while we used methylcellulose water solution to test swallowing behavior of mechanoreceptor mutants, and sucrose solution for flies with md-C neurons expressing GCaMP since they hardly drank water when their head capsules were open.

      How starved or water-deprived were the flies?

      One day prior to the behavioral assays, flies were transferred to empty vials (without water or food) for 24 hours for water deprivation. Flies who could not survive 24h deprivation would be deprived for 12h.

      How exactly was the pumping frequency (shown in Fig 1B) measured? There is no description in the methods at all. If the pump frequency is scored by changes in blue food intensity (arbitrary units?), this seems very subjective and maybe image angle dependent. What was camera frame rate? Can it capture this pumping speed adequately? Given the wealth of more quantitative methods for measuring food intake (eg. CAFE, flyPAD), it seems that better data could be obtained.

      How was the total volume of the cibarium measured? What do the pie charts in Figure 3A represent?

      The pump frequency was computed as the number of pumps divided by the time scale, following the methodology outlined in Manzo et al., 2012. Swallowing curves were plotted using the inverse of the blue food intensity in the cibarium. In this representation, ascending lines signify filling, while descending lines indicate emptying (see Figure 2D, 3B). We maintain objectivity in our approach since, during the recording of swallowing behavior, the fly was fixed, and we exclusively used data for analysis when the Region of Interest (ROI) was in the cibarium. This ensures that the intensity values accurately reflect the filling and emptying processes. Furthermore, we conducted manual frame-by-frame checks of pump frequency, and the results align with those generated by the time series analyzer V3 of ImageJ.

      For the assessment of total volume of ingestion, we referred the methods of CAFE, utilizing a measurable glass capillary. We then calculated the ingestion rate (nL/s) by dividing the total volume of ingestion by the feeding time.

      The changes seem small, in spite of the claim of statistical significance.

      The observed stability in pump frequency within a given genotype underscores the significance of even seemingly small changes, which is statistically significant. We speculate that the stability in swallowing frequency suggests the existence of a redundant mechanism to ensure the robustness of the process. Disruption of one channel might potentially be partially compensated for by others, highlighting the vital nature of the swallowing mechanism.

      How is this change in pump frequency consistent with defects in one aspect of the cycle - either ingestion (activation) or expulsion (inhibition)?

      Please refer to Figure 2, 3. Both filling and emptying process were affects, while inhibition mainly influences emptying time (Figure 1—figure supplement 1).

      for the authors:

      Line 48: extensively

      Line 62 - undiscovered.

      Line 107, 463: multi

      Line 124: What is "dysphagia?" This is an unusual word and should be defined.

      Line 446: severe

      Line 466: in the cibarium or not?

      Thanks for your correction and all the places mentioned were revised.

    1. Author Response

      The following is the authors’ response to the original reviews.

      Thank you for organizing the reviews for our manuscript: Behavioral entrainment to rhythmic auditory stimulation can be modulated by tACS depending on the electrical stimulation field properties,” and for the positive eLife assessment. We also thank the reviewers for their constructive comments. We have addressed every comment, which has helped to improve the transparency and readability of the manuscript. The main changes to the manuscript are summarized as follows:

      1. Surrogate distributions were created for each participant and session to estimate the effect of tACS-phase lag on behavioral entrainment to the sound that could have occurred by chance or because of our analysis method (R1). The actual tACS-amplitude effects were normalized relative to the surrogate distribution, and statistical analysis was performed on the normalized (z-score) values. This analysis did not change our main outcome: that tACS modulates behavioral entrainment to the sound depending on the phase lag between the auditory and the electrical signals. This analysis has now been incorporated into the Results section and in Fig. 3c-d.

      2. Two additional supplemental figures were created to include the single-participant data related to Fig. 3b and 3e (R2).

      3. Additional editing of the manuscript has been performed to improve the readability.

      Below, you will find a point-by-point response to the reviewers’ comments.

      Reviewer #1 (Public Review):

      We are grateful for the reviewer’s positive assessment of the potential impact of our study. The reviewer’s primary concerns were 1) the tACS lag effects reported in the manuscript might be noise because of the realignment procedure, and 2) no multiple comparisons correction was conducted in the model comparison procedure.

      In response to point 1), we have reanalyzed the data in exactly the manner prescribed by the reviewer. Our effects remain, and the new control analysis strengthens the manuscript. 2) In the context of model comparison, the model selection procedure was not based on evaluating the statistical significance of any model or predictor. Instead, the single model that best fit the data was selected as the model with the lowest Akaike’s information criterion (AIC), and its superiority relative to the second-best model was corroborated using the likelihood ratio test. Only the best model was evaluated for significance and analyzed in terms of its predictors and interactions. This model is an omnibus test and does not require multiple comparison correction unless there are posthoc decompositions. For similar approaches, see (Kasten et al., 2019).

      Below, we have responded to each comment specifically or referred to this general comment.

      Summary of what the authors were trying to achieve.

      This paper studies the possible effects of tACS on the detection of silence gaps in an FM-modulated noise stimulus. Both FM modulation of the sound and the tACS are at 2Hz, and the phase of the two is varied to determine possible interactions between the auditory and electric stimulation. Additionally, two different electrode montages are used to determine if variation in electric field distribution across the brain may be related to the effects of tACS on behavioral performance in individual subjects.

      Major strengths and weaknesses of the methods and results.

      The study appears to be well-powered to detect modulation of behavioral performance with N=42 subjects. There is a clear and reproducible modulation of behavioral effects with the phase of the FM sound modulation. The study was also well designed, combining fMRI, current flow modeling, montage optimization targeting, and behavioral analysis. A particular merit of this study is to have repeated the sessions for most subjects in order to test repeat-reliability, which is so often missing in human experiments. The results and methods are generally well-described and well-conceived. The portion of the analysis related to behavior alone is excellent. The analysis of the tACS results is also generally well described, candidly highlighting how variable results are across subjects and sessions. The figures are all of high quality and clear. One weakness of the experimental design is that no effort was made to control for sensation effects. tACS at 2Hz causes prominent skin sensations which could have interacted with auditory perception and thus, detection performance.

      The reviewer is right that we did not control for the sensation effects in our paradigm. We asked the participants to rate the strength of the perceived stimulation after each run. However, this information was used only to assess the safety and tolerability of the stimulation protocol. Nevertheless, we did not consider controlling for skin sensations necessary given the within-participant nature of our design (all participants experienced all six tACS–audio phase lag conditions, which were identical in their potential to cause physical sensations; the only difference between conditions was related to the timing of the auditory stimulus). That is, while the reviewer is right that 2-Hz tACS can indeed induce skin sensation under the electrodes, in this study, we report the effects that depend on the tACS-phase lag relative to the FM-stimulus. Note that the starting phase of the FM-stimulus was randomized across trials within each block (all six tACS audio lags were presented in each block of stimulation). We have no reason to expect the skin sensation to change with the tACS-audio lag from trial to trial, and therefore do not consider this to be a confound in our design. We have added some sentences with this information to the Discussion section:

      Pages 16-17, lines 497-504: “Note that we did not control for the skin sensation induced by 2-Hz tACS in this experiment. Participants rated the strength of the perceived stimulation after each run. However, this information was used only to assess the safety and tolerability of the stimulation protocol. It is in principle possible that skin sensation would depend on tACS phase itself. However, in this study, we report effects that depend on the relationship between tACS-phase and FM-stimulus phase, which changed from trial to trial as the starting phase of the FM-stimulus was randomized across trials. We have no reason to expect the skin sensation to change with the tACS-audio lag and therefore do not consider this to be a confound in our data.”

      Appraisal of whether the authors achieved their aims, and whether the results support their conclusions.

      Unfortunately, the main effects described for tACS are encumbered by a lack of clarity in the analysis. It does appear that the tACS effects reported here could be an artifact of the analysis approach. Without further clarification, the main findings on the tACS effects may not be supported by the data.

      Likely impact of the work on the field, and the utility of the methods and data to the community.

      The central claim is that tACS modulates behavioral detection performance across the 0.5s cycle of stimulation. However, neither the phase nor the strength of this effect reproduces across subjects or sessions. Some of these individual variations may be explainable by individual current distribution. If these results hold, they could be of interest to investigators in the tACS field.

      The additional context you think would help readers interpret or understand the significance of the work.

      The following are more detailed comments on specific sections of the paper, including details on the concerns with the statistical analysis of the tACS effects.

      The introduction is well-balanced, discussing the promise and limitations of previous results with tACS. The objectives are well-defined.

      The analysis surrounding behavioral performance and its dependence on the phase of the FM modulation (Figure 3) is masterfully executed and explained. It appears that it reproduces previous studies and points to a very robust behavioral task that may be of use in other studies.

      Again, we would like to thank the reviewer for the positive assessment of the potential impact of our work and for the thoughtful comments regarding the methodology. For readability in our responses, we have numbered the comments below.

      1. There is a definition of tACS(+) vs tACS(-) based on the relative phase of tACS that may be problematic for the subsequent analysis of Figures 4 and 5. It seems that phase 0 is adjusted to each subject/session. For argument's sake, let's assume the curves in Fig. 3E are random fluctuations. Then aligning them to best-fitting cosine will trivially generate a FM-amplitude fluctuation with cosine shape as shown in Fig. 4a. Selecting the positive and negative phase of that will trivially be larger and smaller than a sham, respectively, as shown in Fig 4b. If this is correct, and the authors would like to keep this way of showing results, then one would need to demonstrate that this difference is larger than expected by chance. Perhaps one could randomize the 6 phase bins in each subject/session and execute the same process (fit a cosine to curves 3e, realign as in 4a, and summarize as in 4b). That will give a distribution under the Null, which may be used to determine if the contrast currently shown in 4b is indeed statistically significant.

      We agree with the reviewer’s concerns regarding the possible bias induced by the realignment procedure used to estimate tACS effects. Certainly, when adjusting phase 0 to each participant/session’s best tACS phase (peak in the fitting cosine), selecting the positive phase of the realigned data will be trivially larger than sham (Fig. 4a). This is why the realigned zero-phase and opposite phase (trough) bins were excluded from the analysis in Fig. 4b. Therefore, tACS(+) vs. tACS(-) do not represent behavioral entrainment at the peak positive and negative tACS lags, as both bins were already removed from the analysis. tACS(+) and tACS(-) are the averages of two adjacent bins from the positive and negative tACS lags, respectively (Zoefel et al., 2019). Such an analysis relies on the idea that if the effect of tACS is sinusoidal, presenting the auditory stimulus at the positive half cycle should be different than when the auditory stimulus lags the electrical signal by the other half. If the effect of tACS was just random noise fluctuations, there is no reason to assume that such fluctuations would be sinusoidal; therefore, any bias in estimating the effect of tACS should be removed when excluding the peak to which the individual data were realigned. Similar analytical procedures have been used previously in the literature (Riecke et al., 2015; Riecke et al., 2018). We have modified the colors in Fig. 4a and 4c (former 4b) and added a new panel to the figure (new 4b) to make the realignment procedure, including the exclusion of the realigned peak and trough data, more visually obvious.

      Moreover, we very much like the reviewer’s suggestion to normalize the magnitude of the tACS effect using a permutation strategy. We performed additional analyses to normalize our tACS effect in Fig. 4c by the probability of obtaining the effect by chance. For each subject and session, tACS-phase lags were randomized across trials for a total of 1000 iterations. For each iteration, the gaps were binned by the FM-stimulus phase and tACS-lag. For each tACS-lag, the amplitude of behavioral entrainment to the FM-stimulus was estimated (FM-amplitude), as shown in Fig. 3. Similar to the original data, a second cosine fit was estimated for the FM-amplitude by tACS-lag. Optimal tACS-phase was estimated from the cosine fit and FM-amplitude values were realigned. Again, the realigned phase 0 and trough were removed from the analysis, and their adjacent bins were averaged to obtain the FM-amplitude at tACS(+) and tACS(−), as shown in Fig. 4c. We then computed the difference between 1) tACS(+) and sham, 2) tACS(-) and sham, and 3) tACS(+) and tACS (-), for the original data and the permuted datasets. This procedure was performed for each participant and session to estimate the size of the tACS effect for the original and surrogate data. The original tACS effects were transformed to z-scores using surrogate distributions, providing us with an estimate of the size of the real effect relative to chance. We then computed one-sample t-tests to compare whether the effects of tACS were statistically significant. In fact, this analysis showed that the tACS effects were still statistically significant. This analysis has been added to the Results and Methods sections and is included in Figure 4d.

      Page 10, lines 282-297: “In order to further investigate whether the observed tACS effect was significantly larger than chance and not an artifact of our analysis procedure (33), we created 1000 surrogate datasets per participant and session by permuting the tACS lag designation across trials. The same binning procedure, realignment, and cosine fits were applied to each surrogate dataset as for the original data. This yielded a surrogate distribution of tACS(+) and tACS(-) values for each participant and session. These values were averaged across sessions since the original analysis did not show a main effect of session. We then computed the difference between tACS(+) and sham, tACS(-) and sham, and tACS(+) and tACS(-), separately for the original and surrogate datasets. The obtained difference for the original data where then z-scored using the mean and standard deviation of the surrogate distribution. Note that in this case we used data of all 42 participants who had at least one valid session (37 participants with both sessions). Three one-sample t-tests were conducted to investigate whether the size of the tACS effect obtained in the original data was significantly larger than that obtained by chance (Fig. 4d). This analysis showed that all z-scores were significantly higher than zero (all t(41) > 2.36, p < 0.05, all p-values corrected for multiple comparisons using the Holm-Bonferroni method).”

      Page 31, lines 962-972: “To further control that the observed tACS effects were not an artifact of the analysis procedure, the difference between the tACS conditions (sham, tACS(+), and tACS(-)) were normalized using a permutation approach. For each participant and session, 1000 surrogate datasets were created by permuting the tACS lag designation across trials. The same binning procedure, realignment, and cosine fits were applied to each surrogate dataset as for the original data (see above). FM-amplitude at sham, tACS(+) and tACS(-) were averaged across sessions since the original analysis did not show a main effect of session. Difference between tACS conditions were estimated for the original and surrogate datasets and the resulting values from the original data were z-scored using the mean and standard deviation from the surrogate distributions. One-sample t-tests were conducted to test the statistical significance of the z-scores. P-values were corrected for multiple comparisons using the Holm-Bonferroni method.”

      1. Results of Fig 5a and 5b seem consistent with the concern raised above about the results of Fig. 4. It appears we are looking at an artifact of the realignment procedure, on otherwise random noise. In fact, the drop in "tACS-amplitude" in Fig. 5c is entirely consistent with a random noise effect.

      Please see our response to the comment above.

      1. To better understand what factors might be influencing inter-session variability in tACS effects, we estimated multiple linear models ..." this post hoc analysis does not seem to have been corrected for multiple comparisons of these "multiple linear models". It is not clear how many different things were tried. The fact that one of them has a p-value of 0.007 for some factors with amplitude-difference, but these factors did not play a role in the amplitude-phase, suggests again that we are not looking at a lawful behavior in these data.

      We suspect that the reviewer did not have access to the supplemental materials where all tables (relevant here is Table S3) are provided. This post hoc analysis was performed as an exploratory analysis to better understand the factors that could influence the inter-session variability of tACS effects. In Table S3, we provide the formula for each of the seven models tested, including their Akaike information criteria corrected for small samples (AICc), R2, F, and p-values. As described in the methods section, the winning model was selected as the model with the smallest AICc. A similar procedure has been previously used in the literature (Kasten et al., 2019). Moreover, to ensure that our winning model was better at explaining the data than the second-best unrestricted model, we used the likelihood ratio test. After choosing the winning model and before reporting the significance of the predictors, we examined the significance of the model in and of itself, taking into account its R2 as well as F- and p-values relative to a constant model. Thus, only one model is being evaluated in terms of statistical significance. Therefore, to our understanding, there are no multiple comparisons to correct for. We added the information regarding the selection procedure, hoping this will make the analysis clearer.

      See page 12, lines 354-360: “This model was selected because it had the smallest Akaike’s information criterion (corrected for small samples), AICc. Moreover, the likelihood ratio test showed no evidence for choosing the more complex unrestricted model (stat = 2.411, p = 0.121). Following the same selection criteria, the winning model predicting inter-session variability in tACS-phase, included only the factor gender (Table S4). However, this model was not significant in and of itself when compared to a constant model (F-statistic vs. constant model: 3.05, p = 0.09, R2 = 0.082).”

      1. "So far, our results demonstrate that FM-stimulus driven behavioral modulation of gap detection (FM-amplitude) was significantly affected by the phase lag between the FM-stimulus and the tACS signal (Audio-tACS lag) ..." There appears to be nothing in the preceding section (Figures 4 and 5) to show that the modulation seen in 3e is not just noise. Maybe something can be said about 3b on an individual subject/session basis that makes these results statistically significant on their own. Maybe these modulations are strong and statistically significant, but just not reproducible across subjects and sessions?

      Please see our response to the first comment regarding the validity of our analysis for proving the significant effect of tACS lag on modulating behavioral entrainment to the FM-stimulus (FM-amplitude), and the new control analysis. After performing the permutation tests, to make sure the reported effects are not noise, our statistical analysis still shows that tACS-lag does significantly modulate behavioral entrainment to the sound (FM-amplitude). Thus, the reviewer is right to say “these modulations are strong and statistically significant, just not reproducible across subjects and sessions”. In this regard, we consider our evaluation of session-to-session reliability of tACS effects is of high relevance for the field, as this is often overlooked in the literature.

      1. "Inter-individual variability in the simulated E-field predicts tACS effects" Authors here are attempting to predict a property of the subjects that was just shown to not be a reliable property of the subject. Authors are picking 9 possible features for this, testing 33 possible models with N=34 data points. With these circumstances, it is not hard to find something that correlates by chance. And some of the models tested had interaction terms, possibly further increasing the number of comparisons. The results reported in this section do not seem to be robust, unless all this was corrected for multiple comparisons, and it was not made clear?

      We thank the reviewer very much for this comment. While the reviewer is right that in these models, we are trying to predict an individual property (tACS-amplitude) that was not test–retest reliable across sessions, we still consider this to be a valid analysis. Here, we take the tACS-amplitude averaged across sessions, trying to predict the probability of a participant to be significantly modulated by tACS, in general, regardless of day-to-day variability. Regarding the number of multiple regression models, how we chose the winning model and the appropriateness/need of multiple-comparisons correction in this case, please see our explanation under “Reviewer 1 (Public review)” and our response to comment 3.

      1. "Can we reduce inter-individual variability in tACS effects ..." This section seems even more speculative and with mixed results.

      We agree with the reviewer that this section is a bit speculative. We are trying to plant some seeds for future research can help move the field forward in the quest for better stimulation protocols. We have added a sentence at the end of the section to explicitly say that more evidence is needed in this regard.

      Page 14, lines 428-429: “At this stage, more evidence is needed to prove the superiority of individually optimized tACS montages for reducing inter-individual variability in tACS effects.”

      Given the concerns with the statistical analysis above, there are concerns about the following statements in the summary of the Discussion:

      1. "2) does modulate the amplitude of the FM-stimulus induced behavioral modulation (FM-amplitude)"

      This seems to be based on Figure 4, which leaves one with significant concerns.

      Please see response to comment 1. We hope the reviewer is satisfied with our additional analysis to make sure the effect of tACS here reported is not noise.

      1. "4) individual variability in tACS effect size was partially explained by two interactions: between the normal component of the E-field and the field focality, and between the normal component of the E-field and the distance between the peak of the electric field and the functional target ROIs."

      The complexity of this statement alone may be a good indication that this could be the result of false discovery due to multiple comparisons.

      We respectfully disagree with the reviewer’s opinion that this is a complex statement. We think that these interaction effects are very intuitive as we explain in the results and discussion sections. These significant interactions show that for tACS to be effective, it matters that current gets to the right place and not to irrelevant brain regions. We believe this finding is of great importance for the field, since most studies on the topic still focus mostly on predicting tACS effects from the absolute field strength and neglect other properties of the electric field.

      For the same reasons as stated above, the following statements in the Abstract do not appear to have adequate support in the data:

      "We observed that tACS modulated the strength of behavioral entrainment to the FM sound in a phase-lag specific manner. ... Inter-individual variability of tACS effects was best explained by the strength of the inward electric field, depending on the field focality and proximity to the target brain region. Spatially optimizing the electrode montage reduced inter-individual variability compared to a standard montage group."

      Please see response to all previous comments

      In particular, the evidence in support of the last sentence is unclear. The only finding that seems related is that "the variance test was significant only for tACS(-) in session 2". This is a very narrow result to be able to make such a general statement in the Abstract. But perhaps this can be made clearer.

      We changed this sentence in the abstract to:

      Page 2, lines 41-43: “Although additional evidence is necessary, our results also provided suggestive insights that spatially optimizing the electrode montage could be a promising tool to reduce inter-individual variability of tACS effects.”

      Reviewer #3 (Public Review):

      In "Behavioral entrainment to rhythmic auditory stimulation can be modulated by tACS depending on the electrical stimulation field properties" Cabral-Calderin and collaborators aimed to document 1) the possible advantages of personalized tACS montage over standard montage on modulating behavior; 2) the inter-individual and inter-session reliability of tACS effects on behavioral entrainment and, 3) the importance of the induced electric field properties on the inter-individual variability of tACS.

      To do so, in two different sessions, they investigated how the detection of silent gaps occurring at random phases of a 2Hz- amplitude modulated sound could be enhanced with 2Hz tACS, delivered at different phase lags. In addition, they evaluated the advantage of using spatially optimized tACS montages (information-based procedure - using anatomy and functional MRI to define the target ROI and simulation to compare to a standard montage applied to all participants) on behavioral entrainment. They first show that the optimized and the standard montages have similar spatial overlap to the target ROI. While the optimized montage induced a more focal field compared to the standard montage, the latter induced the strongest electric field. Second, they show that tACS does not modify the optimal phase for gap detection (phase of the frequency-modulated sound) but modulates the strength of behavioral entrainment to the frequency-modulated sound in a phase-lag specific manner. However, and surprisingly, they report that the optimal tACS lag, and the magnitude of the phasic tACS effect were highly variable across sessions. Finally, they report that the inter-individual variability of tACS effects can be explained by the strength of the inward electric field as a function of the field focality and on how well it reached the target ROI.

      The article is interesting and well-written, and the methods and approaches are state-of-the-art.

      Strengths:

      • The information-based approach used by the authors is very strong, notably with the definition of subject-specific targets using a fMRI localizer and the simulation of electric field strength using 3 different tACS montages (only 2 montages used for the behavioral experiment).

      • The inter-session and inter-individual variability are well documented and discussed. This article will probably guide future studies in the field.

      Weaknesses:

      • The addition of simultaneous EEG recording would have been beneficial to understand the relationship between tACS entrainment and the entrainment to rhythmic auditory stimulation.

      We are grateful for the Reviewer’s positive assessment of our work and for the reviewer’s recommendations. We agree with the reviewer that adding simultaneous EEG or MEG to our design would have been beneficial to understand tACS effects. However, as the reviewer might be familiar with, such combination also possesses additional challenges due to the strong artifacts induced by tACS in the EEG signals, which is at the frequency of interest and several orders of magnitude higher than the signal of interest. Unfortunately, the adequate setup for simultaneous tACS-EEG was not available at the moment of the study. Nevertheless, since we are using a paradigm that we have repeatedly studied in the past and have shown it entrains neural activity and modulates behavior rhythmically, we are confident our results are of interest on their own. For readability of our answers, we numbered to comments below.

      1. It would have been interesting to develop the fact that tACS did not "overwrite" neural entrainment to the auditory stimulus. The authors try to explain this effect by mentioning that "tACS is most effective at modulating oscillatory activity at the intended frequency when its power is not too high" or "tACS imposes its own rhythm on spiking activity when tACS strength is stronger than the endogenous oscillations but it decreases rhythmic spiking when tACS strength is weaker than the endogenous oscillations". However, it is relevant to note that the oscillations in their study are by definition "not endogenous" and one can interpret their results as a clear superiority of sensory entrainment over tACS entrainment. This potential superiority should be discussed, documented, and developed.

      We thank the reviewer very much for this remark. We completely agree that our results could be interpreted as a clear superiority of sensory entrainment over tACS entrainment. We have now incorporated this possibility in the discussion.

      Page 16, line 472-478: “Alternatively, our results could simply be interpreted as a clear superiority of the auditory stimulus for entrainment. In other words, sensory entrainment might just be stronger than tACS entrainment in this case where the stimulus rhythm was strong and salient. It would be interesting to further test whether this superiority of sensory entrainment applies to all sensory modalities or if there is a particular advantage for auditory stimuli when they compete with electrical stimulation. However, answering this question was beyond the scope of our study and needs further investigations with more appropriate paradigms.”

      1. The authors propose that "by applying tACS at the right lag relative to auditory rhythms, we can aid how the brain synchronizes to the sounds and in turn modulate behavior." This should be developed as the authors showed that the tACS lags are highly variable across sessions. According to their results, the optimal lag will vary for each tACS session and subtle changes in the montage could affect the effects.

      We thank the reviewer for this remark. We believe that the right procedure in this case would be using close-loop protocols where the optimal tACS-lag is estimated online as we discuss in the summary and future directions sub-section. We tried to make this clearer in the same sentence that the reviewer mentioned.

      Page 17, line 506-508: “Since optimal tACS phase was variable across participants and sessions, this approach would require closed-loop protocols where the optimal tACS lag is estimated online (see next section).”

      1. In a related vein, it would be very useful to show the data presented in Figure 3 (panels b,d,e) for all participants to allow the reader to evaluate the quality of the data (this can be added as a supplementary figure).

      Thank you very much for the suggestion. We have added two new supplemental figures (Fig S1 and S2) to show individual data for Fig. 3b and 3e. Note that Fig. 3d already shows the individual data as each circle represents optimal FM-phase for a single participant.

      Reviewer #1 (Recommendations For The Authors):

      Minor comments:

      "was optimized in SimNIBS to focus the electric field as precisely as possible at the target ROI" It appears that some form of constrained optimization was used. It would be good to clarify which method was used, including a reference.

      Indeed, SimNIBS implements a constrained optimization approach based on pre-calculated lead fields. We have added the corresponding reference. All parameters used for the optimization are reported in the methods (see sub-section Electric field simulations and montage optimization). Regarding further specifics, the readers are invited to check the MATLAB code that was used for the optimization which is made available at: https://osf.io/3yutb

      "Thus, each montage has its pros and cons, and the choice of montage will depend on which of these dependent measures is prioritized." Well put. It would be interesting to know if authors considered optimizing for intensity on target. That would give the strongest predicted intensity on target, which seems like an important desideratum. Individualizing for something focal, as expected, did not give the strongest intensity. In fact, the method struggled to achieve the desired intensity of 0.1V/m in some subjects. It would be interesting to have a discussion about why this particular optimization method was selected.

      The specific optimization method used in this study was somewhat arbitrary, as there is no standard in the field. It was validated in prior studies, where it was also demonstrated that it performs favorably compared to alternative methods (Saturnino et al., 2019; Saturnino et al., 2021). The underlying physics of the head volume conductor generally limits the maximally achievable focality, and requires a tradeoff between focality and the desired intensity in the target. This tradeoff depends on the maximal amount of current that can be injected into the electrodes due to safety limits (4 mA in total in our case). Further constraints of the optimization in our application were the simultaneous targeting of two areas, and achieving field directions in the targets roughly parallel to those of auditory dipoles. Given the combination of these constraints, as the reviewer noticed, we could not even achieve the desired intensity of .1V/m in some subjects. As we wanted to stimulate both auditory cortices equally, our priority was to have the E-fields as similar as possible between hemispheres. Future studies optimizing for only one target would be easier to optimize for target intensity (assuming the same maximal total current injection). Alternatively, relaxing the constraint on direction and optimizing only for field intensity would help to increase the field intensities in the targets, but would lead to differing field directions in the two targets. As an example, see Rev. Fig.1 below. We extensively discuss some of these points in the discussion section: “Are individually optimized tACS montage better?” (Pages 21-22).

      Additionally, we added a few sentences in the Results and Methods giving more details about the optimization approach.

      Page 5, lines 115-116: “Using individual finite element method (FEM) head models (see Methods) and the lead field-based constrained optimization approach implemented in SimNIBS (31)”

      Page 27, lines 819-822: “The optimization pipeline employed the approach described in (31) and was performed in two steps. First, a lead field matrix was created per individual using the 10-10 EEG virtual cap provided in SimNIBS and performing electric field simulations based on the default tissue conductivities listed below.”

      Author response image 1.

      E-field distributions for one example participant. Brain maps show the results from the same optimization procedure described in the main manuscript but with no constraint for the current direction (top) or constraining the current direction (bottom). Note that the desired intensity of .1 V/m can be achieved when the current direction is not constrained.

      The terminology of "high-definition HD" used here is unconventional and may confuse some readers. The paper cited for ring electrodes (18) does not refer to it as HD. A quick search for high-definition HD yields mostly papers using many small electrodes, not ring electrodes. They look more like what was called "individualized". More conventional would be to call the first configuration a "ring-electrode", and the "individualized" configuration might be called "individualized HD".

      We thank the reviewer for this remark. We changed the label of the high-definition montage to ring-electrode. Regarding the individualized configuration, we prefer not to use individualized HD as it has the same number of electrodes as the standard montage.

      "So far, we have evaluated whether tACS at different phase lags interferes with stimulus-brain synchrony and modulates behavioral signatures of entrainment" The paper does not present any data on stimulus-brain synchrony. There is only an analysis of behavior and stimulus/tACS phase.

      We agree with the reviewer. To be more careful with such statement we now modified the sentence to say:

      Page 10, lines 303-304: “So far, we have evaluated whether tACS at different phase lags modulates behavioral signatures of entrainment: FM-amplitude and FM-phase.”

      "However, the strength of the tACS effect was variable across participants." and across sessions, and the phase also was variable across subjects and sessions.

      "tACS-amplitude estimates were averaged across sessions since the session did not significantly affect FM-amplitude (Fig. 5a)." More importantly, the authors show that "tACS-amplitude" was not reproducible across sessions.

      Unfortunately, we did not understand what the reviewer is suggesting here, and would have to ask the reviewer in this case to provide us with more information.

      References

      Kasten FH, Duecker K, Maack MC, Meiser A, Herrmann CS (2019) Integrating electric field modeling and neuroimaging to explain inter-individual variability of tACS effects. Nat Commun 10:5427. Riecke L, Sack AT, Schroeder CE (2015) Endogenous Delta/Theta Sound-Brain Phase Entrainment Accelerates the Buildup of Auditory Streaming. Curr Biol 25:3196-3201.

      Riecke L, Formisano E, Sorger B, Baskent D, Gaudrain E (2018) Neural Entrainment to Speech Modulates Speech Intelligibility. Curr Biol 28:161-169 e165.

      Saturnino GB, Madsen KH, Thielscher A (2021) Optimizing the electric field strength in multiple targets for multichannel transcranial electric stimulation. J Neural Eng 18.

      Saturnino GB, Siebner HR, Thielscher A, Madsen KH (2019) Accessibility of cortical regions to focal TES: Dependence on spatial position, safety, and practical constraints. Neuroimage 203:116183.

      Zoefel B, Davis MH, Valente G, Riecke L (2019) How to test for phasic modulation of neural and behavioural responses. Neuroimage 202:116175.

    1. Author response:

      The following is the authors’ response to the original reviews

      Public Reviews:  

      Reviewer #1 (Public review):  

      Summary:  

      This work examines the binding of several phosphonate compounds to a membrane-bound pyrophosphatase using several different approaches, including crystallography, electron paramagnetic resonance spectroscopy, and functional measurements of ion pumping and pyrophosphatase activity. The work attempts to synthesize these different approaches into a model of inhibition by phosphonates in which the two subunits of the functional dimer interact differently with the phosphonate.  

      Strengths:  

      This study integrates a variety of approaches, including structural biology, spectroscopic measurements of protein dynamics, and functional measurements. Overall, data analysis was thoughtful, with careful analysis of the substrate binding sites (for example calculation of POLDOR omit maps).  

      Weaknesses:  

      Unfortunately, the protein did not crystallize with the more potent phosphonate inhibitors. Instead, structures were solved with two compounds with weak inhibitory constants >200 micromolar, which limits the molecular insight into compounds that could possibly be developed into small molecule inhibitors. Likewise, the authors choose to focus the spectroscopy experiments on these weaker binders, missing an opportunity to provide insight into the interaction between more potent binders and the protein. 

      We acknowledge the reviewer concern regarding the choice of weaker inhibitors. We attempted cocrystallization with all available inhibitors, including those with higher potency. However, despite numerous efforts, these potent inhibitors yielded low-resolution crystals, making them unsuitable for detailed structural analysis. Therefore, we chose to focus on the weaker binders, as we were able to obtain high-quality crystal structures for these compounds. This allowed us to perform DEER spectroscopy and monitor conformational TmPPase state ensembles in solution with the added advantage of accurately analysing the data against structural models derived from X-ray crystallography. Using these weaker inhibitors enabled a more precise interpretation of the DEER data, thus providing reliable insights into the conformational dynamics and inhibition mechanism. As suggested by the reviewer, in the revised version, we add new DEER experiments, conditions and analysis on two of the more potent inhibitors (alendronate and pamidronate) to provide additional insight into their interactions. Furthermore, we also implemented additional DEER data on the cytoplasmic side of TmPPase; at a new site we identified (with the advantage of being an endogenous cysteine residue) and spin labelled (C599R1), given the DEER data for the previous T211R1cytoplasmic site were difficult to interpret owing to the highly dynamic nature of this region. The new pair C599R1 yielded high-quality DEER traces and indicated more clearly than T211R1, distance distributions consistent with asymmetry across the sampled conditions.  Again, as suggested by the reviewer, alendronate and pamidronate DEER measurements were also recorded for this site (cytoplasmic side; C599R1) as well as the periplasmic side (525R1).

      In general, the manuscript falls short of providing any major new insight into membrane-bound pyrophosphatases, which are a very well-studied system. Subtle changes in the structures and ensemble distance distributions suggest that the molecular conformations might change a little bit under different conditions, but this isn't a very surprising outcome. It's not clear whether these changes are functionally important, or just part of the normal experimental/protein ensemble variation. 

      We respectfully disagree with the reviewer. The scale of motions particularly seen in solution (and now on a new reliable spin pair (C599R1) located on the cytoplasmic side) correspond to those seen in the full panoply of crystal structures of mPPases. Some proteins undergo very large conformational changes during catalysis – such as the rotary ATPase. This one does not, meaning that the precise motions we describe here are relevant and observed in solution for the first time. Conformational changes in the ensemble, whether large or small, represent essential protein motions which underlie key mPPase catalytic function. These dynamic transitions are extremely challenging to monitor, especially in so many conditions and our DEER spectroscopy data demonstrate the sensitivity and resolution necessary to monitor these subtle changes in equilibria, even if these are only a few Angstroms. For several of the conditions we investigated by DEER in solution, corresponding X-ray structures have been solved, with the derived distances agreeing well with the DEER distributions. This further validates the biological relevance of the structures, and reveals the complete conformational ensemble, intractable using other current approaches. Indeed, some conformational states were previously seen using serial time-resolved X-ray static structures and were consistent with asymmetry.

      The ZLD-bound crystal structure doesn't predict the DEER distances, and the conformation of Na+ binding site sidechains in the ZLD structure doesn't predict whether sodium currents occur. This might suggest that the ZLD structure captures a conformation that does not recapitulate what is happening in solution/ a membrane. 

      We agree with the reviewer that the ZLD-bound crystal structure does not predict the DEER distances. However, we believe this discrepancy arises from the steric bulkiness of ZLD inhibitor, which prevents the closure of the hydrolytic centre. Additionally, the absence of Na+ at the ion gate in the ZLD-bound structure suggests that Na+ transport does not occur, a conclusion further supported by our electrometric measurements. We agree with the reviewer; distances observed in the DEER experiments might represent a potential new conformation in solution, not captured by the static X-ray structure, thereby offering new insights into the dynamic nature of the protein under physiological conditions. This serves to emphasize the complementarity of the DEER approach to Xray crystallography and redoubles the importance of using both techniques. Finally, the static X-ray structures have not captured the asymmetric conformations that must exist to explain half-of-thesites reactivity, where DEER yields distance distributions, across all 16 cases tested here (two mutants with eight conditions each), that are consistent with asymmetry.

      Reviewer #2 (Public review):  

      Summary:  

      Crystallographic analysis revealed the asymmetric conformation of the dimer in the inhibitor-bound state. Based on this result, which is consistent with previous time-resolved analysis, authors verified the dynamics and distance between spin introduced label by DEER spectroscopy in solution and predicted possible patterns of asymmetric dimer.  

      Strengths:  

      Crystal structures with inhibitor bound provide detailed coordination in the binding pocket thus useful information for the mPPase field and maybe for drug development.  

      Weaknesses:  

      The distance information measured by DEER is advantageous for verifying the dynamics and structure of membrane protein in solution. However, regarding T211 data, which, as the authors themselves stated, lacks measurement precision, it is unclear for readers how confident one can judge the conclusion leading from these data for the cytoplasmic side. 

      We thank the reviewer for acknowledging the advantageous use of the DEER methodology for identifying dynamic states of membrane proteins in solution. In our original manuscript, we used two sites in our analysis: S525 (periplasm) and T211 (cytoplasm), in which S525R1 yielded highquality DEER data, while T211R1 yielded weak (or no) visual oscillations, leading to broad distributions for the several conditions tested. In the revised manuscript, we now added a third site at the cytoplasmic side (C599R1 located at TMH14), which yielded high-quality DEER data and comparable to S525R1. Both C599R1 and C525R1 spin pairs generated distance distributions for all 16 conditions (two mutants of eight conditions each) that were described well by the solution-state ensemble adopting a predominantly asymmetric conformation.  

      Furthermore, we have tailored our interpretation of the T211R1 DEER data, and refrain from using the data to draw conclusions about the TmPPase conformational ensemble in the presence of different inhibitors. However, we still opted to include the T211R1 data in the SI because they confirm an important structural feature of mPPase in solution conditions; the intrinsically dynamic behaviour of the loop5-6 where T211 is located. This observation in solution is also consistent with our previous (Kellosalo et al., Science, 2012; Li et al., Nat. Commun, 2016; Vidilaseris et al., Sci. Adv., 2019; Strauss et al., EMBO Rep., 2024) and current X-ray crystallography data. To reiterate, we excluded T211R1 from any analysis relating to mPPase asymmetry and our conclusions were entirely based on the S525R1 and new C599R1 DEER data, which allowed us to monitor both sides on the membrane.  

      The distance information for the luminal site, which the authors claim is more accurate, does not indicate either the possibility or the basis for why it is the ensemble of two components and not simply a structure with a shorter distance than the crystal structure.  

      We thank the reviewer for pointing out this possibility and alternative interpretation of our DEER data. We now provide further analysis to show that our DEER data from both membrane sides reporters are highly consistent with (although they cannot completely exclude) asymmetry and rephrase to be inclusive of other possibilities. Importantly, this additional possibility does not affect the current interpretation of the data in our manuscript. Furthermore, we have removed Fig. 6 from the manuscript, and we now include a direct comparison of the in silico predicted distribution coming from the asymmetric hybrid structure with the 8 conditions tested, for both mutants (i.e. S525R1 and C599R1).

      Reviewer #3 (Public review):  

      Summary:  

      Membrane-bound pyrophosphatases (mPPases) are homodimeric proteins that hydrolyze pyrophosphate and pump H+/Na+ across membranes. They are attractive drug targets against protist pathogens. Non-hydrolysable PPi analogue bisphosphonates such as risedronate (RSD) and pamidronate (PMD) serve as primary drugs currently used. Bisphosphonates have a P-C-P bond, with its central carbon can accommodate up to two substituents, allowing a large compound variability. Here the authors solved two TmPPase structures in complex with the bisphosphonates etidronate (ETD) and zoledronate (ZLD) and monitored their conformational ensemble using DEER spectroscopy in solution. These results reveal the inhibition mechanism of these compounds, which is crucial for developing future small molecule inhibitors.  

      Strengths:  

      The authors show that seven different bisphosphonates can inhibit TmPPase with IC50 values in the micromolar range. Branched aliphatic and aromatic modifications showed weaker inhibition.  

      High-resolution structures for TmPPase with ETD (3.2 Å) and ZLD (3.3 Å) are determined. These structures reveal the binding mode and shed light on the inhibition mechanism. The nature of modification on the bisphosphonate alters the conformation of the binding pocket.  

      The conformational heterogeneity is further investigated using DEER spectroscopy under several conditions.  

      Weaknesses:  

      The authors observed asymmetry in the TmPPase-ELD structure above the hydrolytic center. The structural asymmetry arises due to differences in the orientation of ETD within each monomer at the active site. As a result, loop5-6 of the two monomers is oriented differently, resulting in the observed asymmetry. The authors attempt to further establish this asymmetry using DEER spectroscopy experiments. However, the (over)interpretation of these data leads to more confusion than any further understanding. DEER data suggest that the asymmetry observed in the TmPPase-ELD structure in this region might be funneled from the broad conformational space under the crystallization conditions. 

      We respectfully disagree with the reviewer. The asymmetry was previously established using serial time crystallography (Strauss et al., EMBO Rep, 2024) and biochemical assays (e.g. Malinen et al., Prot. Sci., 2022; Artukka et al., Biochem J, 2018; Luoto et al., PNAS, 2013) and partially seen in one static structure (Vidilaseris et al., Sci Adv 2019). DEER data here also show that the previously proposed asymmetry is also present (and this presence of asymmetry is consistent across all DEER data) within the TmPPase conformational ensemble in solution conditions. Although we cannot rule out the possibility that the TmPPase monomers adopt a metastable intermediate state, in such a case we would expect the distance changes reported by DEER to be symmetric across both membrane sides. However, we observe a symmetry breaking between the cytoplasmic and periplasmic TmPPase sites. Indeed, DEER data yield distance distributions similar to that of the hybrid asymmetric structure under all: apo, +Ca, +Ca/ETD, +ETD, +ZLD, +IDP, +PAM, +ALE conditions.

      DEER data for position T211R1 at the enzyme entrance reveal a highly flexible conformation of loop56 (and do not provide any direct evidence for asymmetry, Figure EV8).

      Please see relevant response above. We acknowledge that T211 is indeed situated on a highly dynamic loop, which is important for gating and our DEER data confirm the high flexibility of this protein region. Given we have not observed dipolar oscillations, leading to broad distributions, we have stated in the original manuscript that we will not establish the presence of any asymmetry in solution on the basis of T211, rather relying on the S525R1 and the new C599R1 sites, for which we have acquired high-quality DEER data, as was also pointed out and has been commented on by all reviewers. We have provided data at the C599R1 position (same cytoplasmic side as 211 for which we have now limited our analysis to a minimum) which further provides evidence for asymmetry, including two new conditions.

      Similarly, data for position S521R1 near the exit channel do not directly support the proposed asymmetry for ETD.  

      The reviewer appears to suggest that we hold the S525R1 DEER data as direct proof of asymmetry; this is combative on the grounds that to directly prove asymmetry would require time-resolved DEER measurements, far beyond the scope of this work. Rather, we have applied DEER measurements to explore whether asymmetry (observed previously via time-resolved X-ray crystallography) is also present (or indeed a possibility) in solution. All our S525R1 and C599R1 DEER data (recorded for eight conditions) are consistent with asymmetry (see also detailed response above).

      Despite the high quality of the data, they reveal a very similar distance distribution. The reported changes in distances are very small (+/- 0.3 nm), which can be accommodated by a change of spin label rotamer distribution alone. Further, these spin labels are located on a flexible loop, thereby making it difficult to directly relate any distance changes to the global conformation

      We thank the reviewer for recognising the high quality of our DEER data for the S525R1 site which we now complement with a new pair on the cytoplasmic facing membrane side (C599R1) with DEER data of comparable quality as for S525R1, where visual oscillations in the raw traces for both spin pairs, as in our case, reportedly lead to highly accurate and reliable distributions, able to separate (in fortuitous cases) helical movements of only a few Angstroms (Peter et al., Nature Comms 13:4396, 2022; Klose et al., Biophys J 120:4842-4858, 2021). The ability of DEER/PELDOR offering near Angstrom resolution was also previously demonstrated by the acquisition and solution of highresolution multi-subunit spin-labelled membrane protein structures (Pliotas at al., PNAS, 2012; Pliotas et al., Nat Struct Mol Biol, 2015; Pliotas, Methods Enzymol, 2017) as well as its ability in detecting small (and of similar to mPPase magnitude) conformational changes in different integral membrane protein systems (Kapsalis et al., Nature Comms, 2019; Kubatova et al., PNAS, 2023; Schmidt et al., JACS, 2024; Lane et al., Structure, 2024; Hett et al., JACS, 2021; Zhao et al., Nature, 2024), occurring under different conditions and/or stimuli in solution and/or lipid environment. The changes here are not below the detection sensitivity of DEER (e.g. ~ 7 Angstroms between the two modal distance extremes (+Ca vs +IDP for S525R1), and with all other conditions showing intermediate changes.  

      We agree with the reviewer that these changes are relatively small, but they are expected for membrane ion pumps. Indeed, none of the mPPase structures show helical movements of greater than half a turn, and that only in helices 6 and 12. There appear to be larger-scale loop closing motions of the 5-6 loop that includes T211, due to the presence of E217 which binds to one of the Mg<sup>2+</sup> ions that coordinate the leaving group phosphate. This is, inter alia, the reason that this loop is so flexible: it cannot order before substrate is bound.  

      The reviewer suggests that the subtle distance shifts detected arise only from changes of label rotamer distribution. However, the concerted nature of the modal distance shifts with respect to multiple different conditions at a single labelling site strongly suggests that preferential rotamer orientations are not the cause. Indeed, for so many spin labels to undergo an arbitrary shift that the modal distance of the entire distribution changes – and in the absence of any conformational change – appears improbable. Here we have the resolution to detect such subtle differences by DEER, given there are unambiguous shifts in our time domain data (i.e. the position of the minimum of the first dipolar oscillation) (Fig 4) and these are reflected in the modal distances in the distributions. We also refrain from performing any quantitative analysis and use qualitative trends in modal distance shifts only; all which support our proposed model of a symmetry breaking across the membrane face. To further belabour this point, we do not quantify the DEER data (for instance through parametric fitting) to extract populations of different conformational states and we appreciate that to do so would be highly prone to error; however we do (and can, we feel without over-interpretation) assert that the modal distances shift.  

      The interpretations listed below are not supported by the data presented:  

      (1) 'In the presence of Ca2+, the distance distribution shifts towards shorter distances, suggesting that the two monomers come closer at the periplasmic side, and consistent with the predicted distances derived from the TmPPase:Ca structure.'

      Problem: This is a far-stretched interpretation of a tiny change, which is not reliable for the reasons described in the paragraph above. 

      While the authors overall agree with the reviewer assessment that ±0.3 nm is a small (not a minor) change, there are literature examples quantifying (or using for quantification) distribution peaks separated by similar Δr. (Kubatova et al., PNAS, 2023; Schmidt et al., JACS, 2024; Hett et al., JACS, 2021; Zhao et al., Nature, 2024). However, the time-domain data clearly indicate the position of the first minimum of the dipolar oscillation shifts to shorter dipolar evolution time. The sensitivity of the time-domain data to subtle changes in dipolar coupling frequency is significantly improved compared to the distance distributions.

      Importantly, we have fitted Gaussians to the experimental distance distributions of 525R1 output by the Comparative Deer Analyzer 2.0 and observed a change in the distribution width in presence of Ca2+, implying the rotameric freedom of the spin label is restricted. However, the CW-EPR for 525R1 indicate that the rotational correlation time of the spin label is highly consistent between conditions (the spectra are almost identical); this cannot be explained simply by rotameric preference of the spin label (as asserted by the reviewer 3), as there is no (further) immobilisation observed from the CW-EPR of apo-state (Figure EV9) to that in presence of Ca2+. Furthermore, in the absence of conformational changes, it is reasonable to assume (and demonstrable from the CW-EPR data) that the rotamer cloud should not significantly change between conditions. However, Gaussian fits of the two extreme cases yielding the longest (i.e., in presence of IDP) and shortest (in presence of ZLD) modal distances for the 525R1 DEER data indicated significant (i.e., above the noise floor after Tikhonov validation) probability density for the IDP condition at 50 Å (P(r) = 0.18). This occurs at four standard deviations above the mean of the Guassian fit to the +ZLD condition, which by random chance should occur with <0.007% probability.  

      As in previous response, the method can detect changes of such magnitude which are not small, but physiologically relevant and expected for integral membrane proteins, such as mPPases. Indeed, even in equal (or more) complex systems such as heptameric mechanosensitive channel proteins DEER provided sub-Angstrom accuracy, when a spin labelled high resolution XRC structure was solved (Pliotas et al., PNAS, 2012; Pliotas et al., Nat Struct Mol Biol, 2015). Despite this being an ideal case where DEER accuracy was experimentally validated another high-resolution structural method on modified membrane protein and is not very common it demonstrates the power of the method, especially when strong oscillations are present in the raw DEER data (as here for mPPase S525R1, and C599R1), even when multiple distances are present, Angstrom resolution is achievable in such challenging protein classes.

      (2) 'Based on the DEER data on the IDP-bound TmPPase, we observed significant deviations between the experimental and the in silico distances derived from the TmPPase:IDP X-ray structure for both cytoplasmic- (T211R1) and periplasmic-end (S525R1) sites (Figure 4D and Figure EV8D). This deviation could be explained by the dimer adopting an asymmetric conformation under the physiological conditions used for DEER, with one monomer in a closed state and the other in an open state.'  

      Problem: The authors are trying to establish asymmetry using the DEER data. Unfortunately, no significant difference is observed (between simulation and experiment) for position 525 as the authors claim (Figure 4D bottom panel). The observed difference for position 112 must be accounted for by the flexibility and the data provide no direct evidence for any asymmetry.  

      Reviewer 3 is incorrect in suggesting that we are trying to prove asymmetry through the DEER data. That is a well-known fact in the literature (e.g. Vidilaseris et al, Sci Adv 2019) where we show (1) that the exit channel inhibitor ATC (i.e. close to S525R1) binds better in solution to the TmPPase:PPi complex than the TmPPase:PPi<sub>2</sub> complex, and (2) that ATC binds in an asymmetric fashion to the TmPPase:IDP<sub>2</sub> complex with just one ATC dimer on one of the exit channels. We merely use the DEER data to support this well-established fact.  

      However, because we agree that the DEER data in presence of IDP does not provide direct proof for asymmetry; particularly for the cytoplasmic facing mutant T211R1, we have refrained from interpreting T211R1 data beyond being a highly dynamic loop region (as evidenced by the broad distributions). As pointed out by the reviewer, the differences in distance distributions between conditions observed for T211R1 likely arise from conformational heterogeneity in solution. Furthermore, we now report DEER data on another new site (C599R1), which is also on the cytoplasmic side and yields high quality DEER data comparable to the S525R1 data (commended for their quality by both the reviewers). The C599R1 measurements show that in all conditions tested, highly similar distributions are observed, inconsistent with the in silico predicted distance distributions from the symmetric X-ray structures, but consistent with an asymmetric hybrid structure (i.e. open-closed) in solution. Importantly, the difference between the fully open (6.8 nm modal distance) and fully closed (4.8 nm modal distance) states of the C599R1 dimer is larger than for the S525R1 dimer pair. Thus, delineating the asymmetric hybrid conformation from the symmetric conformations is more robust.

      (3) 'Our new structures, together with DEER distance measurements that monitor the conformational ensemble equilibrium of TmPPase in solution, provide further solid experimental evidence of asymmetry in gating and transitional changes upon substrate/inhibitor binding.'  

      Problem: See above. The DEER data do not support any asymmetry. 

      We feel that the reviewer comments here are somewhat unfounded. All the DEER data (for 525R1 periplasmic and C599R1 cytoplasmic sites are described, most parsimoniously, using an asymmetric hybrid structure. In particular, the new C599R1 distance distributions are poorly described by the symmetric X-ray crystal structures, with a conserved modal distance of approx. 5.8 nm throughout the tested conditions that aligns nicely with the in silico predictions from the asymmetric hybrid structure. Additionally, all S525R1 and C599R1 data well exceed the relevant criteria of the recent white paper (Schiemann et al., 2021, JACS) from the EPR community to be considered reliably interpretable (strong visual oscillations in the raw traces; signal-to-noise ratio .r.t modulation depth of > 20 in all cases; replicates have been performed and added into the maintext or supplementary; near quantitative labelling efficiency (evidenced by lack of free spin label signal in the CW-EPR spectra); analysed using the CDA (now Figure EV10) to avoid confirmation bias).

      While the DEER data do not prove asymmetry, we do not claim proof of asymmetry in the above sentence. We concede to rephrase the offending sentence above as: “Our new structures, together with DEER distance measurements that monitor the conformational ensemble of TmPPase in solution, do not exclude asymmetry in gating and transitional changes upon substrate/inhibitor binding and are consistent with our proposed model.” We feel that this reframed conjecture of asymmetry is well founded; indeed, comparing all the 16 experimentally derived DEER distance distributions for the 525R1 and 599R1 sites with in-silico modelling performed on the hybridised asymmetric structure (i.e., comprised of one monomer bound to Ca2+ and another bound to IDP) yields overlap coefficients (Islam and Roux, JPC B, 2015) of >0.85. This implies the envelope of the modelled distance distribution is quantitatively inside the envelope of the experimental distance distributions. Thus, the DEER data support asymmetry (previously observed by time-resolved XRC) in solution, and while we appreciate that ideally one would measure time-resolved DEER to directly correlate kinetics of conformational changes within the ensemble to the catalytic cycle of mPPase, (and this is something we aim to do in the future), it is far beyond the scope of this study.

      Indeed, half-of-the-sites reactivity has been demonstrated in at least the following papers

      (Vidilaseris et al, Sci Acv. ,2019, Strauss et al, EMBO Rep. 2024, Malinen et al Prot Sci, 2022, Artukka et al Biochem J, 2018; Luoto et al, PNAS, 2013). Half-of-the sites activity requires asymmetry in the mechanism, and therefore asymmetric motions in the active site (viz 211) and exit channel (viz 525). As mentioned above, we have demonstrated this for other inhibitors (Vidilaseris et al 2019) and as part of a time-resolved experiment (Strauss et al 2024). In fact, given the wealth of evidence showing that the symmetrical crystal structures sample a non- or less-productive conformation of the protein, it would be quixotic to propose the DEER experiments - in solution - do not generate asymmetric conformations. It certainly doesn’t obey Occam’s razor of choosing the simplest possible explanation that covers the data.

      (4) Based on these observations, and the DEER data for +IDP, which is consistent with an asymmetric conformation of TmPPase being present in solution, we propose five distinct models of TmPPase (Figure 7).  

      Problem: Again, the DEER data do not support any asymmetry and the authors may revisit the proposed models. 

      We have redressed the proposed models and limited them to four asymmetric models to clearly illustrate the apo/+Ca/+Ca:ETD-state (model 1) and highlight the distinct binding patterns of various inhibitors (ETD, ZLD and IDP; model 2-4), which result in a variety of closed/open-open states. In this version, we clarify that the proposed models are not solely based on the DEER data but all DEER data recorded for multiple conditions, inhibitors and for two opposite membrane side facing reporters are highly consistent, and are grounded in both current and previously solved structures, with the DEER data providing additional consistency with these models.

      (5) 'In model 2 (Figure 7), one active site is semi-closed, while the other remains open. This is supported by the distance distributions for S525R1 and T211R1 for +Ca/ETD informed by DEER, which agrees with the in silico distance predictions generated by the asymmetric TmPPase:ETD X-ray structure'  

      Problem: Neither convincing nor supported by the data 

      We respectfully disagree with the reviewer. However, owing to the conformational heterogeneity of T211R1, we now exclude T211R1 data from quantitative interpretation of changes to the conformational ensemble. Instead, we include new DEER data from site C599R1, which provides high-quality and convincing data that is consistent with asymmetry at the cytoplasmic face, and inconsistent with in silico distance distributions derived from symmetric X-ray crystal structures. Furthermore, the S525R1 distance distributions for the +ETD (corresponding to +Ca/ETD) and +ZLD conditions were directly compared with both the apo-state distance distribution (corresponding to a fully open, symmetric conformation) and the in silico predicted distributions of the asymmetric hybrid structure (corresponding to an open-closed conformation). Overlap coefficients were calculated (given in the main text) that indicated the +ETD (corresponding to +Ca/ETD) and +ZLD S525R1 distributions were more consistent with the apo-state distance distribution. This suggests that while on the cytosolic face of the membrane, an open-closed conformation is favoured, on the periplasmic face, a symmetric open-open conformation is favoured.

      Recommendations for the authors:  

      Reviewer #1 (Recommendations for the authors):   

      (1) The DEER experiments were performed with the two crystallized inhibitors, ETD and ZLD, along with previously characterized IDP. It would increase the impact of a tighter-binding phosphonate was examined since the inhibitory mechanism of these molecules is of greater interest. 

      We acknowledge the reviewer concern regarding the choice of weaker inhibitors. We chose to focus on the weaker binders, as we were able to obtain high-quality crystal structures for these compounds. This allowed us to perform DEER spectroscopy with the added advantage of accurately analysing the data against structural models derived from X-ray crystallography. In the revised version, we also include results from alendronate and pamidronate, two of the tighter inhibitors, which show similar and consistent results to the others.

      (2) I'm not able to find the concentrations of ETD and ZLD used for the DEER experiments. This information should be added to the Methods section on sample prep for EPR. 

      The information is already mentioned in the Method section on sample preparation for EPR spectroscopy (page 24), where we indicated that the protein aliquots were incubated with a final concentration of 2 mM inhibitors or 10 mM CaCl2 (30 min, RT). However, we recognise that this may not have been sufficiently clear. To clarify, we now explicitly state that the concentration of ETD and ZLD (amongst other inhibitors) used for the DEER experiments is 2 mM.  

      (3) There should be additional detail about the electrometry replicates. Does "triplicate" mean three measurements on the same sensor, three different sensors, and different protein preparations? At a minimum, data should be collected from three different sensors to ensure that the negative results (lack of current) for ETD and ZLD are not due to a failed sensor prep. In addition, Data from the other replicates should be shown in a supplementary figure, either the traces, or in a summary figure. Are the traces shown collected on the same sensor? They could be, in principle, since the inhibitor is washed away after each perfusion. 

      Yes, by 'triplicate', we mean three measurements taken on the same sensor. All traces shown were collected from a single sensor. Thank you for your advice; we now show here additional data from other sensors that display the same pattern. As for the possibility of a failed sensor preparation, this is unlikely since we always ensure the sensor quality with the substrate (PPi) as a positive control after each measurement.

      Author response image 1.

      (4) I'm confused by the NEM modification assay, and I don't think there is enough information in this manuscript for a reader to figure out what is happening. Why is the protein active if an inhibitor is present? I understand that there is a conformational change in the presence of the inhibitor that buries a cysteine, but the inhibitor itself should diminish function, correct? Is the inhibitor removed before testing the function? In addition, it would be clearer if the cysteines that are modified are indicated in the main text. I don't understand what is being shown in Figure Ev2. Shouldn't the accessible cysteines in the apo form be shown? Finally, the sentence "IDP has been reported to prevent the NEM modification..." does not make sense to me. Should the word "by" be removed from this sentence? 

      We apologize for the confusion. Yes, the inhibitors were removed before testing the protein function. In Figure EV2, the accessible cysteines are shown for both the apo and IDP-bound states. As seen, the accessible cysteines in the IDP-bound states are fewer than those in the apo state, meaning fewer cysteines are available for modification. Consequently, more activity is retained when IDP binds due to the reduction in accessible cysteines. We have addressed this in the manuscript (see the method section on the NEM modification assay).

      (5) Why does the model in Figure 7 show the small molecules bound to only one subunit, when they are crystallized in both subunits? 

      We propose that the small molecules bound to the two subunits in the crystal structure is likely a result of substrate inhibition, given the excess inhibitor used during crystallisation (e.g. Artukka, et al., Biochemical Journal, 2018; Vidilaseris, et al., Science Advances, 2022). Our PELDOR data indicate that in solution, the small molecules bound to TmPPase are in an intermediate state between both subunits being closed and both being open, most likely with at least one subunit in an open state. This is also consistent with previous kinetic studies (Anashkin, V. A., et al., International Journal of Molecular Sciences, 22, 2021), which showed that the binding constant of IDP to the second subunit is around 120 times higher than that of the first subunit.

      (6) The authors argue that the two ETDs bound in the two protomers adopt distinct conformations. Can this be further supported, for example, by swapping the position of the two ETDs between the two protomers and calculating a difference map (there should be corresponding negative/positive density if the modelling of the two different conformations is robust)? 

      As per the reviewer suggestion, we swapped the positions of the two ETDs between the protomers and calculated the difference electron density map. This analysis, presented in Figure EV3, reveals corresponding negative and positive electron density peaks, indicating that the ETDs indeed adopt distinct conformations in each protomer, supporting the accuracy of our modeling.

      (7) Are the changes in loop conformation possibly due to crystal packing differences for the two protomers? 

      We examined the crystal packing of the two protomers and found no interactions at the loop regions (red coloured in Author response image 2 below) that could be attributed to crystal packing differences. Therefore, we rule out this possibility.

      Author response image 2.

      (8) Typos:  

      Legend for Figure EV2 cystine - cysteine  

      Page 14, last sentence of the first paragraph: further - further  

      Figure 6 legend: there is no reference to panel B.  

      Thanks for pointing out the typos, now they are fixed.

      Reviewer #2 (Recommendations for the authors):  

      (1) T211 is located on the same loop where ligand/inhibitor-coordinating side chains (E217, D218) are located. It has not been tested whether spin labeling here would affect inhibitor binding. 

      We test all the mutant(s) activity before spin labelling, but not the activity of the spin-labelled mutants. MTSSL spin labels are typically not structurally perturbing. In particular, the T211R1 site that the reviewer is referring to is now not included in our interpretation of conformational changes occurring during mPPase’s functional cycle.

      (2) Why should the spin label be introduced to T211, which is recognized as a flexible region in the crystal structure? Authors should search for suitable residues except for T211 and other residues in this loop to evaluate the cytoplasmic distance. 

      We acknowledge the reviewer’s concern regarding the flexibility of the T211 region for spin labelling. Given the challenges associated with TmPPase, including reduced protein expression, loss of function, or inaccessibility upon spin labelling at certain sites, we have explored alternative residues. After extensive testing, we identified C599 as a suitable site for spin labelling resulting in high-quality DEER data. The results from spin labelling at C599 have been incorporated into the revised manuscript.

      (3) On the other hand, DEER data for S525 is solid, as the authors stated. This residue is located on the luminal side of the enzyme. However, the description of the luminal side structure and the comparison of symmetric/asymmetric dimer in this par are missing in the paper. 

      We thank the viewer for their positive assessment of the S525R1 DEER data. The data for 525 and now also for 599 spin pairs are indeed solid given the strong visual oscillation we observed particularly in such a challenging system.   

      We presented the periplasmic sites in the crystal structure dimer (Figure 4A), highlighting both the symmetrical region and the asymmetric model in Figure 4. In the revised version, we include additional details about this region and our rationale for labeling at position S525.

      (4) The conclusion models (Figure 7) are misleading. In the crystal structure, the 5-6Loop distance between each monomer should be close given the location of the dimer interface, and the actual distance between T211 in the structure (for example, in 5lzq) is about 10A. Nevertheless, the model depicts this distance longer than S525 (40.7A in 5LZQ), which would give a false impression. 

      We would like to apologize for the misleading model. We have now corrected the models to ensure they are consistent with their respective regions in the crystal structures.

      (5) P8 last paragraph  

      It is hard to imagine that in a crystal lattice, the straight inhibitor always binds to monomer A, and the neighboring monomer is always attached to a slightly tilted inhibitor, which causes asymmetry. For example, wouldn't it mean that it would first bind to one of them, which would then affect the neighboring monomer via 5-6 Loop, which would then affect its binding pose? So in this case, the inhibitor did not ARAISE asymmetry, and this is where it is misleading for readers. 

      We apologize for the confusion. What we intended to convey is that the first inhibitor binds to one protomer, which then affects the conformation of the neighbouring monomer, ultimately influencing its binding pose. This is required for half-of-the-sites reactivity, which is well-established in this system. This is reflected in our crystal structure, where we observed asymmetry in the loop 5-6 region and the ETD orientation between the two protomers. We have addressed this in the manuscript accordingly.

      (6) P11 L4 EV10 instead of EV8? 

      Thanks for pointing out. We have corrected it accordingly.

      (7) P11 L5 It is difficult to determine whether the peak is broad or sharp. Should be evaluated quantitatively by showing the half-value width of the peak. This may also be helpful to judge whether the peak is a mixture of two components or a single one. 

      We have taken this analysis out and rephrased the offending sentence. We have also added the FWHM values as the Reviewer suggested, and corresponding standard deviations for the distance distributions (under approximation as Gaussian distribution).   

      (8) Throughout the paper, the topology of the enzyme may be difficult to follow for readers who are not experts in this field. Please indicate the membrane plane's location or a figure's viewpoint in the caption. 

      We acknowledge the importance of making our figures accessible to all readers. In the revised manuscript, we have enhanced the clarity of our figures by explicitly indicating the membrane plane’s location and specifying the viewpoint in each figure caption. For example, we have added annotations such as “Top view of the superposition of chain A (cyan) and chain B (wheat), showing the relative movements (black arrow) of helices. The membrane plane is indicated by dashed lines.”

      (9) Figure 2B Check the color of the helix.  

      IDP and ETD are almost the same color, so it is difficult to see the superposition. It would be easier to understand the reading by, for example, using a lighter or transparent color set only for IDPs.  

      We acknowledge the reviewer concern regarding the colour similarity between the IDP and ETD in Figure 2B, which hinders clear differentiation. To enhance visual distinction, we have adjusted the colour scheme by changing the TmPPase:IDP structure colour to light blue. This modification improves the clarity of the superposition, making the structural differences more discernible.

      (10) Figure 2C Check the coordination state (dotted line), there appears to be coordination between E217Cg and Mg. Also, water that is located near N492 appears to be a bit distant from Mg, why does this act as a ligand? Stereo view or view from different angles, and distance information would help the reader understand the bonding state in more detail.  

      Yes, we confirm that Mg<sup>2+</sup> is coordinated by the oxygen atoms from both the side chain and main chain of residue E217. The water molecule near N492 is not directly coordinated with Mg<sup>2+</sup> but interacts with the O5 atom of one of the phosphate groups in ETD. To enhance clarity, we have updated Figure 2C (and other related figures) to include stereo views.  

      (11) Figure 5A: in the Bottom view (lower left), the symmetric dimer does not look symmetric. Better to view from a 2-fold axis exactly.  

      We have taken this figure out entirely and instead add a direct comparison to the in silico predicted distribution from the asymmetric hybrid structure to all 16 experimental DEER distributions. We have added the symmetric and asymmetric structures to Fig. 4A and view the symmetric structure along the 2-fold axis, as suggested.   

      (12) Figure 5B: Indicate which data is plotted in the caption.  

      As mentioned above, we have taken this figure out, as we felt quantifying two overlapping populations from a single Gaussian was over-interpretation of the data, and at the suggestion of reviewer 3, we have tailored our interpretation here.  

      (13) Figure EV8:  

      Because the authors discuss a lot about their conclusive model based on this data, Figure EV8 should be treated as a main figure, not a supplement. However, this reviewer has serious concerns about the measurement in this figure. Because DEER for T211 is too noisy, I don't see the point in discussing this in detail. For example, in the Ca/ETD data, there is a peak near 50A, but it would be difficult for TM5 to move away from this distance unless the protein unfolds. I do not find it meaningful to discuss using measurement results in which such an impossible distance is detected as a peak.  

      A: Show top view as in Figure 5  

      D: 2nd row dotted line. Regarding the in silico model that is used as a reference to compare the distance information, the distance of 40-50 A for T211 in the Ca-bound form is hard to imagine. PDB 4av6 model shows that T211 is disordered and not visible, but given the position of the TM5 helix, it does not appear to be that different from the IDR binding structure (5LZQ, 10A between two T211). The structures of in silico models are not shown in the figure, as it is only mentioned as modeled in Rossetafold. Please indicate their structures, especially focused on the relative orientation of T211 and S525 in the dimer, which would allow readers to determine the distances.  

      We acknowledge the reviewer’s concerns regarding Figure EV8 and the DEER data for T211R1. Upon re-evaluation, we recognize that the non-oscillating nature of the DEER data for T211R1 leads to broad distributions, indicating increased conformational dynamics, which is expected for a highly dynamic loop. Consequently, we have limited the discussion and interpretation of T211R1 in the revised manuscript and focused more on C599R1.

      Reviewer #3 (Recommendations for the authors):  

      A careful interpretation of the data in view of these limitations and without directly linking to asymmetry could solve the problem of the over-interpretation of the DEER data.  

      We respectfully disagree with the reviewer. Please see our detailed response above.  

      Additional comments:  

      (1) Did the authors use a Cys-less construct for spin labeling and DEER experiments?  

      We utilized a nearly Cys-less construct in which all native cysteines were mutated to serine, except for Cys183, which was retained due to its buried location and functional importance. We then introduced single cysteine mutations for spin labelling. For C599, Ser599 was reverted to cysteine.

      (2) The time data for position T211R1 is too short for most cases (Figure EV8D) for a reliable distance determination. No confidence interval is given for the '+Ca' sample distance distributions.  

      We recorded longer time traces for two of the conditions to better assign the background. We did not use the 211R1 data to reach any conclusions regarding asymmetry, which were based on the 525R1 and the 599R1 data. We now simply include T211R1 data to indicate the high mobility observed at loop5-6. We have added the confidence interval for the +Ca condition.  

      (3) It is recommended to mention the 2+1 artefact obvious at the end of the DEER data. 

      In the methods section, we have mentioned that the “2+1” artefact present at the end of the S525R1, and T211R1 DEER data likely arises from using a 65 MHz offset, rather than an 80 MHz offset (as for the C599R1 data), which avoids significant overlap of the pump and detection pulses. We also mention in the methods section that owing to the intense “2+1” artefact, the decision was made to truncate the artefact away, to minimise the impact on data treatment. As for motivation to use the lower offset of 65 MHz, we did so to maximise the achievable signal-to-noise ratio (SNR), as particularly for the T211R1 data, the detected echo was quite weak. This was further exacerbated by the poor transverse relaxation time observed at that site.  

      (4) Please check the number of significant digits for all the reported values. 

      We have addressed the number of significant digits as requested.

      (5) Please report the mean distances from DEER experiments with the standard deviation or FWHM.

      We have addressed this in the revised manuscript, we report modal distances rather than the mean distances and provide the FWHM and standard deviation.

    1. Author response:

      The following is the authors’ response to the original reviews

      Reviewer #1:

      Weaknesses:

      (1) Only Experiment 1 of Rademaker et al (2019) is reanalyzed. The previous study included another experiment (Expt 2) using different types of distractors which did result in distractor-related costs to neural and behavioral measures of working memory. The Rademaker et al (2019) study uses these two results to conclude that neural WM representations are protected from distraction when distraction does not impact behavior, but conditions that do impact behavior also impact neural WM representations. Considering this previous result is critical for relating the present manuscript's results to the previous findings, it seems necessary to address Experiment 2's data in the present work

      We thank the reviewer for the proposal to analyze Experiment 2 where subjects completed the same type of visual working memory task, but instead had either a flashing orientation distractor or a naturalistic (gazebo or face) distractor present during two-thirds of the trials. As the reviewer points out, unlike Experiment 1, these two conditions in Experiment 2 had a behavioral impact on recall accuracy, when compared to the blank delay. We have now run the temporal cross-decoding analysis, temporally-stable neural subspace analysis, and condition cross-decoding analysis in Experiment 2. The results from the stable subspace analysis are present in Figure 3, while the results from the temporal cross-decoding analysis and condition cross-decoding analysis are present in the Supplementary Data.

      First, we are unable to draw strong conclusions from the temporal cross-decoding analysis, as the decoding accuracies across time in Experiment 2 are much lower compared to Experiment 1. In some ROIs of the naturalistic distractor condition we see that some diagonal elements are not part of the above-chance decoding cluster, making it difficult to draw any conclusions regarding dynamic clusters. We do see some dynamic coding in the naturalistic condition in V3 where the off-diagonals do not show above-chance decoding. Since the temporal cross-decoding provides low accuracies, we do not examine the dynamics of neural subspaces across time.

      We do, however, run the stable subspace analysis on the flashing orientation distractor condition. Just like in Experiment 1, we examine temporally stable target and distractor subspaces. When projecting the distractor onto the working memory target subspace, we see a higher overlap between the two as compared to Experiment 1. A similar pattern is seen also when projecting the target onto the distractor subspace. We still see an above-chance principal angle between the target and distractor; however, this angle is qualitatively smaller compared to Experiment 1. This shows that the degree of separation between the two neural subspaces is impacted by behavioral performance during recall.

      (2) Primary evidence for 'dynamic coding', especially in the early visual cortex, appears to be related to the transition between encoding/maintenance and maintenance/recall, but the delay period representations seem overall stable, consistent with previous findings

      We agree with the reviewer that we primarily see dynamic coding between the encoding/maintenance and at the end of the maintenance periods, implying the WM representations are stable in most ROIs. The only place where we argue that we might see more dynamic coding during the delay itself is in V1 during the noise distractor trials in Experiment 1.

      (3) Dynamicism index used in Figure 1f quantifies the proportion of off-diagonal cells with significant differences in decoding performance from the diagonal cell. It's unclear why the proportion of time points is the best metric, rather than something like a change in decoding accuracy. This is addressed in the subsequent analysis considering coding subspaces, but the utility of the Figure 1f analysis remains weakly justified.

      We agree that other metrics can also provide a summary of dynamics; here, the dynamicism index just acts as a summary visualizing the dynamic elements. It offers an intuitive way to visualize peaks and troughs of the dynamic code across the extent of the trial.

      (4) There is no report of how much total variance is explained by the two PCs defining the subspaces of interest in each condition, and timepoint. It could be the case that the first two principal components in one condition (e.g., sensory distractor) explain less variance than the first two principal components of another condition.

      We thank the reviewer for this comment. We have now included the percent variance explained for the two PCs in both the temporally-stable target and distractor subspace and the dynamic subspace analysis. The percent-explained is comparable across analyses; the first PC ranges from 43-50% and the second ranges from 28-37%. The PCs within each analysis (dynamic no-distractor, orientation and noise distractor; temporally-stable target and distractor) are even closer in range (Figure 2c and 3d).

      (5) Converting a continuous decoding metric (angular error) to "% decoding accuracy" serves to obfuscate the units of the actual results. Decoding precision (e.g., sd of decoding error histogram) would be more interpretable and better related to both the previous study and behavioral measures of WM performance.

      We thank the reviewer for the comments. FCA is a linear function of the angular error that uses the following equation:

      We think that the FCA does not obfuscate the results, but instead provides an intuitive scale where 0% accuracy corresponds to a 180° error, 50% to a 90° error and so on. This also makes it easy to reverse-calculate the absolute error if need be. Our lab has previously used this method in other neuroimaging papers with continuous variables (Barbieri et al. 2023, Weber et al. 2024).

      We do, however, agree that “% decoding accuracy” does not provide an accurate reflection of the metric used. We have thus now changed “% decoding accuracy” to “Accuracy (% FCA)”.

      (6) This report does not make use of behavioral performance data in the Rademaker et al (2019) dataset.

      We have now analyzed Experiment 2 which, as previously mentioned by the reviewer and unlike Experiment 1, showed a decrease in recall accuracy during the two distractor conditions. We address the results from Experiment 2 in a previous response (please see Weaknesses 1).

      We do not, however, relate single subject behavioral performance to neural measurements, as we do not think there is enough power to do so with a small number of subjects in both Experiment 1 and 2. 

      (7) Given there were observed differences between individual retinotopic ROIs in the temporal cross-decoding analyses shown in Figure 1, the lack of data presented for the subspace analyses for the corresponding individual ROIs is a weakness

      We have now included an additional supplementary figure that shows individual plots of each ROI for the temporally stable subspace analysis for both Experiment 1 and Experiment 2 (Supplementary Figure 5). 

      Reviewer #1 (Recommendations For The Authors):

      (1) Is there any relationship between stable/dynamic coding properties and aspects of behavioral performance? This seems like a major missed opportunity to better understand the behavioral relevance or importance of the proposed dynamic and orthogonal coding schemes. For example, is it the case that participants who have more orthogonal coding subspaces between orientation distractor and remembered orientation show less of a behavioral consequence to distracting orientations? Less induced bias? I know these differences weren't significant at the group level in the original study, but maybe individual variability in the metrics of this study can explain differences in performance between participants in the reported dataset

      As mentioned in the previous response, we do not run individual correlations between dynamic or orthogonal coding metrics and behavioral performance, because of the small number of subjects in both experiments. We believe that for a brain-behavior correlation between average behavioral error of subjects and an average brain measure, we would need a larger sample size.  

      (2) The voxel selection procedure differs from the original study. The authors should add additional detail about the number of voxels included in their analyses, and how this number of voxels compares to that used in the original study.

      We have now added a figure summarizing the number of voxels selected across participants. We do select fewer voxels compared to Rademaker et al. 2019 (see their Supplementary Tables 9 and 10 and our Supplementary Figure 8). For example we have ~500 voxels on average in V1 in Experiment 1, while the original study had ~1000. As mentioned in the methods, we aimed to select voxels that reliably responded to both the perception localizer conditions and the working memory trials.

      (3) Lines 428-436 specify details about how data is rescaled prior to decoding. The procedure seems to estimate rescaling factors according to some aspect of the training data, and then apply this rescaling to the training and testing data. Is there a possibility of leakage here? That is - do aspects of the training data impact aspects of the testing data, and could a decoder pick up on such leakage to change decoding? It seems this is performed for each training/testing timepoint pair, and so the temporal unfolding of results may depend on this analysis choice.

      Thank you for the suggestion. To prevent data leakage, the mean and standard deviation are computed exclusively from the training set. These scaling parameters are then applied to the test set, ensuring that no information from the test set influences the training process. This transformation simply adjusts the test set to the same scale as the training data, without exposing the model to unseen test data during training.

      (4) Figure 1d, V1: it looks like the 'dynamics' are a bit non-symmetric - perhaps the authors could comment on this detail of the results? Why would we expect there would be a dynamic cluster on one side of the diagonal, but not the other? Given that this region, condition is the primary evidence for a dynamic code that's not related to the beginning/end of delay (see other comments), figuring this out is of particular importance.

      We thank the reviewer for this question. We think that this is just due to small numerical differences in the upper and lower triangles of the matrix, rather than a neuroscientifically interesting effect. However, this is only a speculative observation.

      (5) I think it's important to address the issue I raised in "weaknesses" about variance explained by the top N principal components in each condition. What are we supposed to learn from data projected into subspaces fit to different conditions if the subspaces themselves are differently useful?

      Thank you, this has now been addressed in a previous comment (please see Weakness 4). 

      Reviewer #2:

      Weaknesses:

      (1) An alternative interpretation of the temporal dynamic pattern is that working memory representations become less reliable over time. As shown by the authors in Figure 1c and Figure 4a, the on-diagonal decoding accuracy generally decreased over time. This implies that the signal-to-noise ratio was decreasing over time. Classifiers trained with data of relatively higher SNR and lower SNR may rely on different features, leading to poor generalization performance. This issue should be addressed in the paper.

      We thank the reviewer for raising this issue and we have now run three simulations that aim to address whether a changing SNR across time might create dynamic clusters. 

      In the first simulation we created a dataset of 200 voxels that have a sine or cosine response function to orientations between 1° to 180°, the same orientations as the remembered target. A circular shift is applied to each voxel to vary preferred (or maximal) responses of each simulated voxel. We then assess the decoding performance under different SNR conditions during training and testing. For each of the seven iterations we selected 108 responses (out of 180) to train on and 108 to test on. To increase variability the selected trials differed in each iteration. Random white noise was applied to the data and thus the SNR was independently scaled according to the specified levels for train and test data. We then use the same pSVR decoder as in the temporal cross decoding analysis to train and test. 

      The second and third simulations more directly address whether increased noise levels  would induce the decoder to rely on different features of the no-distractor and noise distractor data. We use empirical data from the primary visual cortex (V1; where dynamic coding was seen in the noise distractor trials) under the no-distractor and noise distractor conditions for the second and third simulations, respectively. Data from time points 5.6–8.8 seconds after stimulus onset are averaged across five TRs. As in the first simulation, SNR is systematically manipulated by adding white noise. Additionally, to see whether the initial decrease in SNR and subsequent increase would result in dynamic coding clusters, we initially increased and subsequently decreased the amplitude of added noise. The same pSVR decoder was used to train and test on the data with different levels of added noise.

      We see an absence of dynamic elements in the SNR cross-decoding matrices, as the decoding accuracy primarily depends on the training data rather than test data. This results in some off-diagonal values in the decoding matrix that are higher, rather than smaller, than corresponding on-diagonal elements.

      We have now added a Methods section explaining the simulations in more detail and Supplementary Figure 9 showing the SNR cross-decoding matrices. 

      (2) The paper tests against a strong version of stable coding, where neural spaces representing WM contents must remain identical over time. In this version, any changes in the neural space will be evidence of dynamic coding. As the paper acknowledges, there is already ample evidence arguing against this possibility. However, the evidence provided here (dynamic coding cluster, angle between coding spaces) is not as strong as what prior studies have shown for meaningful transformations in neural coding. For instance, the principal angle between coding spaces over time was smaller than 8 degrees, and around 7 degrees between sensory distractors and WM contents. This suggests that the coding space for WM was largely overlapping across time and with that for sensory distractors. Therefore, the major conclusion that working memory contents are dynamically coded is not well-supported by the presented results.

      We thank the reviewer for this comment. The principal angles we calculate are above-baseline, meaning that we subtract the within-subspace principal angles from the between-subspace principal angles and take the average. Thus a 7 degree difference does not imply that there are only 7 degrees separating e.g. the sensory distractor from the target; it just indicates that the separation is 7 degrees above chance. 

      (3) Relatedly, the main conclusions, such as "VWM code in several visual regions did not generalize well between different time points" and "VWM and feature-matching sensory distractors are encoded in separable coding spaces" are somewhat subjective given that cross-condition generalization analyses consistently showed above chance-level performance. These results could be interpreted as evidence of stable coding. The authors should use more objective descriptions, such as 'temporal generalization decoding showed reduced decoding accuracy in off-diagonals compared to on-diagonals.

      Thank you, we agree that our previous claims might have been too strong. We have now toned down our statements in the Abstract and use “did not fully generalize” and “VWM and feature-matching sensory distractors are encoded in coding spaces that do not fully overlap.”

      Reviewer #2 (Recommendations For The Authors):

      Weakness 1 can potentially be addressed with data simulations that fix the signal pattern, vary the noise pattern, and perform the same temporal generalization analysis to test whether changes in SNR can lead to seemingly dynamic coding formats.

      Thank you for the great suggestion. We have now run the suggested simulations. Please see above (response to Weakness 1).

      There are mismatches in the statistical symbols shown in Figure 4 and Supplementary Table 2. It seems that there was a swap between the symbols for the noise between-condition and noise within-condition.

      Thank you, this has now been fixed.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review): 

      (1) In Figure 1, the authors show that TF3C binds to the amino terminus of MYCN (Myc box I region), as shown previously. The data in Figure 1 B-D support, but do not rigorously confirm a 'direct' interaction because it has not been ruled out that accessory proteins mediating the association may be present in the mixture.

      In Figure 1B-D we have purified MYCN and the TFIIIC/TauA complex separately and then mixed the purified preparations, demonstrating that the purified proteins interact. We have additionally performed mass spectrometry, which shows that the TauA/MYCN complex is formed without further accessory proteins, as the molecular weight would be higher. Based on the Coomassie stained SDS-PAGE gels, there is no plausible contaminating band in the purified complex that could be mediating the interaction between MYCN and TauA, either in the purified complex (Figure 1C), or in the purified protein used to reconstitute the complex (Figure S1A & S1B).

      (2) The authors indicate in Figure 2 that TF3C has essentially no effect on MYCNdependent gene expression and/or transcription elongation. Yet a previous study (PMID: 29262328) associated with several of the same authors concluded that TF3C positively affects transcription elongation. The authors make no attempt to reconcile these disparate results and need to clarify this point.

      We agree that the data in this manuscript do not support the role on transcription elongation. This point was also raised by Reviewer 3. Comparing our new results to the data published previously we can summarize that the data sets in the two studies show three key results: First, the traveling ratio of RNAPII changes upon induction of MYCN. Second, RNAPII decreases at the transcription start side and third, it increases towards the end side.

      We agree that in the previous study we linked the traveling ratio directly to elongation. However performing ChIP-seq with different RNAPII antibodies showed us that for example RNAPII (N20), which is unfortunately discontinued, gives different results compared to RNAPII (A10). Combining our new results using the RNAPII (8WG16) antibody shows that the traveling ratio is not only reflecting transcription elongation but also includes that the RNAPII is kicked-off chromatin at the start side.

      (3) Figures 2B and C show that unphosphorylated pol2 is TSS-centered, and Ser2-P pol2 occupation is centered beyond the TES. From this data, however, the reader can't tell how much of the phospho-Ser2- pol2 is centered on the TSS. The authors should include overall plots over TSS and TES, and also perhaps the gene-body to allow a better comparison for TSS and TES plotted for both antibodies over the collected gene sets.

      We focused on the TSS for unphosphorylated RNAPII and the TES for pSer2-RNAPII, as these are the regions with specific enrichment of the respective antibodies. As requested for comparison, we now include metagenes showing TSS, gene-body, and TES for both antibodies as new Figure S2A and B. Additionally, we included density plots for unphosphorylated RNAPII at the TES as well as for pSer2-RNAPII at the TSS as a Figure for the Reviewers (Figure 1).

      (4) The authors see more TF3C at promoters in cells with MYCN (Figure 2F). What are the levels of TF3C in the absence and presence of MYCN?

      As shown in the immunoblot in Figure S1E, TF3C5 levels do not change upon induction of MYCN. We therefore think that MYCN helps to recruit TFIIIC5 to RNAPII promoter sites. This is also in accordance to what we previously reported 1.

      (5) The finding that TF3C is increased at TSS (Figure 2F) doesn't necessarily indicate that 1) MYCN is recruiting TF3C there, and 2) that this is due to the phosphorylation status of pol2. It could mean many other things. The logic of conflating these 3 points based on the data shown is questionable.

      We showed previously that knock-down of MYCN affects TFIIIC5 binding, showing that MYCN is required for binding of TFIIIC5 at promoter sites 1.

      Additionally, we included data with DRB treated cells (Figure 2F), which prevents RNAPII loading by preventing downstream de novo elongation. Those data show that TFIIIC5 binding at the TSS is massively increased upon induction of MYCN and additionally upon treatment with DRB. Conversely, we observed that the major effect of TFIIIC knock-down was at the nonphosphorylated RNAPII at the TSS on MYCN induction (Figure 2B). Therefore, we would argue that our assumption fits well to the data presented in the manuscript.

      (6) Figure 3A doesn't add much to the paper, as it is overplotted and no relationship is clear, except that Pol2 and MYCN occupy many of the same sites. Perhaps a less complex or different type of plot would allow the interactions to be better visible.

      We agree with the comment and since in another comment we were asked to show the same window for all shown Hi-ChIP data plots, we changed Figure 3A.

      (7) That depletion of TF3C leads to increased promoter hubs may or may not have anything to do with its association with MYCN (Figure 4E). This could be a direct consequence of its known structural function in cohesin complexes, and the MYCN changes as a secondary consequence of this (also see point 4, above).

      As shown in Büchel et al. (2017) 1 MYCN is needed to recruit RAD21 and depletion of RAD21 has no impact on the recruitment of MYCN. Since RAD21 is part of the cohesin complex we would exclude that the MYCN changes are a secondary consequence.

      (8) Depletion of TF3C5 results in a loss of EXOSC5 (exosome) at TSS in the presence and absence of MYCN (Figure 5B). As TF3C5 is a cohesin, could this simply be a consequence of genomic structure changes?

      We agree that the discovered changes in EXOSC5 can be due to depletion of TFIIIC5. TFIIIC has been shown to recruit cohesin 1 and condensin complexes 2, as well as inducing chromatin architectural changes 3. However, MYCN is needed to recruit TFIIIC and depletion of TFIIIC had no impact on MYCN recruitment 1. Furthermore, MYCN has been shown to recruit exosome 4. Therefore, we would argue that either MYCN can directly play a role or thru chromatin architectural changes.

      (9) The authors suggest that RNA dynamics are affected by changes in exosome function (RNA degradation, etc). What effect, if any does TF3C depletion have on the overall gene expression profile?

      We show in the manuscript that TFIIIC depletion in unperturbed cells has no effect on the global gene expression profile in the time frame analyzed (Figure 2E and S2B).

      Reviewer #2 (Public Review):

      (1) Dynamic inferences are made without kinetic experiments.

      While we agree that we did not collect kinetic data to study the dynamics of RNA polymerase we would argue that the integration of our different data sets make it possible to draw conclusions about dynamic interferences. The transcription cycle and its sequential steps have been well described. In this sense, we use the non-phosphorylated RNAPII data that is situated between RNAPII recruitment and initiation and RNAPII-pSer2 that shows pause-release to elongation to draw conclusions on the dynamic. Likewise, we also made use of our previous published datasets.

      Reviewer #2 (Recommendations For The Authors):  

      (1) A number of changes are reported in hub size, expression, etc. upon treatment with tamoxifen to activate MCN-ER. But MYC is already present in the SHEP cells, so why doesn't MYC support these same phenomena? It would seem that either the ability to cooperate with TFIIIC to clear non-productive polymerase complexes from promoters is particular to MYCN, or else it reflects a quantitative increase in total MYC proteins due to the entry of MYCN-ER into the nucleus with tamoxifen. The authors should address or discuss this issue.

      It could be that protein levels are the limiting factor between MYC and MYCN observed effects in this system. This interpretation would be in accordance with the results of Lorenzin et al. 5, which reported that different levels of MYC had different targets based on the affinity to Eboxes and protein level. A similar profile of MYC levels compared to function was also reported regarding SPT5 6. Those high protein levels mimic what is found in certain tumors in contrast to physiological levels. In this sense, the observed differences can also be between physiological and oncological levels of MYC proteins.

      On the other hand, it has been described both a core MYC- and an isoform specific-signature of target genes. MYCN is described to be involved in gene expression during the S-phase of the cell cycle 7. This suggests that there are differences between MYC and MYCN other than gene sets. The interaction with TFIIIC appears to be one of these differences. We have found multiple TFIIIC subunits as part of the MYCN interactome, but the interaction of TFIIIC with MYC is weaker and we are uncertain how relevant it is 7,8. We show here that depletion of different subunits of the TFIIIC complex show a MYCN-dependent growth defect (Figure 1 E). Similarly, nuclear exosome is a MYCN-specific dependence 4, and we show here that MYCNdependent recruitment of the exosome requires TFIIIC5. We take this as an indication that there is an intrinsic difference between MYC and MYCN and that MYCN engages TFIIIC for this pathway.

      (2) Reciprocal to TFIIIC recruitment to MYCN- rRNA, and other RNAPIII genes. Does this happen targets would be MYCN association with tRNA genes, 5S, and if so, is this association TFIIIC dependent? What happens to the expression of these genes?

      We did observe MYCN in interactions involving tRNA and other RNAPIII sites, such as SINE elements and tRNAs (Figure 4B, 4D, S3F, and S4B). There was no relevant number of 5S rRNA involved in interactions – either because the difficulty to properly map these repetitive regions or due to biology. In any case, none of those regions appeared to be specifically dependent on TFIIIC as the overall number of interactions increased in TFIIIC depletion regardless of the genomic annotation (Figure S4B). Regarding the expression of RNAPIII genes, we are constrained by technical limitations of poly(A) enrichment RNA-seq to globally analyze it in an unbiased way. However, we addressed this point for tRNAs expression in an earlier work 1 and found that tRNA levels do not change upon TFIIIC depletion. We think this is because tRNAs are stable transcripts and RNAPIII recycling can occur in a TFIIICindependent manner 9. Conversely, we reported no significant expression changes in RNAPII genes upon TFIIIC depletion in this work.

      (3) The authors show that TFIIIC depletion does not alter the RNA-expression profile; how do they account for this? Can they comment on "background" transcription that it would seem should be suppressed by TFIIIC-dependent removal of various hypofunctional polymerases?

      Since TFIIIC is important for the removal of non-functional RNAPII we would not expect changes to the gene expression profile upon depletion of TFIIIC in the time frame analyzed. Monitoring the elongating form of RNAPII by measuring pSer2 indeed shows us that transcription elongation is not affected.

      (4) Global changes in expression are difficult to assess with DESEQ2. This hypernormalizing algorithm is not really suited to distinguish differential, but universal upregulation from some targets being truly upregulated while others are downregulated. The authors should comment.

      The authors acknowledge that DESEQ2 relies on the conjecture that genewise estimates of dispersion are generally unchanged among samples. We address this comment in two different ways. We include those in the Figure for the Reviewers (Figure 2). The first was to sequence samples deeper to avoid any bias created by random effect of lower coverage, the range of total reads increased from 6.8-9.3 to 16.5-20.7 million reads. The second was to compare the fold average bin dot plot for RNA-seq of SH-EP-MYCN-ER showing mRNA expression normalized by control per bin using the DESEQ2 (Figure 2A) normalization to TMM in edgeR (Figure 2B) and to quantile normalization (Figure 2C). No major differences were found from the original data or using the different methods, but we updated the Figure 2E in the manuscript to include the deeper sequencing dataset, we also adjusted it to show -/+ MYCN and transformed to log2 to make it more intuitive. Overall, it enhances our original understanding that gene expression remains largely unaffected by TFIIIC5 knockdown.

      (5) On page 7, the authors claim that MYCN-ER increased Ser-2 can reflect MYCN-stimulated transcription elongation. In fact, without kinetic studies, this is not fully supported. Accumulation of Ser-2 RNAPII along a gene can reflect increased initiation of full-speed RNAPs or a pile-up of RNAPs slowing down. This should be resolved or qualified.

      While we agree that we did not collect kinetic data to study the dynamics of RNA polymerase we would argue that the integration of our different data sets make it possible to draw conclusions about dynamic interferences. We showed on the one side that pSer-2 accumulates on the TES and on the other side the induction of MYCN-ER up-regulates gene expression which proves productive transcription elongation.

      (6) pLHiChIP needs to be better described, the Mumbach reference is not sufficient.

      We have reformulated the pLHiChIP in the method section and hope that this will provide now a better description of the method.

      (7) Can the authors recheck all the labels in Figure 2D-I believe there is an error involving + or - MYCN.

      We carefully rechecked all the labels in Figure 2 and it was correct as it was. We understand the confusion that may have created comparing Figure 2D and Figure 2E. To avoid confusion, we updated Figure 2E to show the same direction of Figure 2D. We also log2 transformed the y-axis of Figure 2E to foster a more intuitive reading.

      (8) Why are there different scales for the regions of chromosome 17 shown in Figures 3 and 4? It would be easier to compare if the examples were all shown at the same scale (about 2 MB is shown in another Figure).

      We now show the same region of chromosome 17 in Figure 3 and 4.

      Reviewer #3 (Public Review):

      (1) The connection between the three major findings presented in this study regarding the role of TFIIIC in the regulation of MYCN function remains unclear. Specifically, how the TFIIICdependent restriction of MYCN localization to promoter hubs enhances the association of factors involved in nascent RNA degradation to prevent the accumulation of inactive RNA polymerase II at promoters is not apparent. As they are currently presented, these findings appear as independent observations. Cross-comparison of the different datasets obtained may provide some insight into addressing this question.

      We previously observed that TFIIIC does not affect MYCN recruitment, while MYCN affects TFIIIC binding 1. Moreover, our group reported that MYCN recruits exosome 4 and BRCA1 to promoter-proximal regions 10 to clear out non-functional RNAPII. We are currently reporting that MYCN-TFIIIC complexes exclude non-functional RNAPII. However, MYCN-active promoter hubs have more RNAPII and more transcription than MYCN-active promoter outside hubs. Furthermore, TFIIIC binding occurs upstream of BRCA1 and exosome recruitments as depletion of TFIIIC leads to recruitment decrease of both factors. Therefore, we argue that TFIIIC is required for the proper function of those MYCN-active promoter hubs.

      (2) Another concern involves the disparities in RNA polymerase II ChIP-seq results between this study and earlier ones conducted by the same group. In Figure 2, the authors demonstrate that activation of MYCN results in a reduction of non-phosphorylated RNA polymerase II across all expressed genes. This discovery contradicts prior findings obtained using the same methodology, where it was concluded that the expression of MYCN had no significant effect on the chromatin association of hypo-phosphorylated RNA polymerase II (Buchel et al, 2017). In this regard, the choice of the 8WG16 antibody raises concern, as fluctuations in the signal may be attributed to changes in the phosphorylation levels of the Cterminal domain. It remains unclear why the authors decided against using antibodies targeting the N-terminal domain of RNA polymerase II, which are unaffected by phosphorylation and consistently demonstrated a significant signal reduction upon MYCN activation in their previous studies (Buchel et al, 2017) (Herold et al, 2019). Similarly, the authors previously proposed that depletion of TFIIIC5 abrogates the MYCN-dependent increase of Ser2phosphorylated RNA polymerase II (Buchel et al, 2017), whereas they now show that it has no obvious impact. These aspects need clarification.

      We politely disagree that our discoveries are contradicting each other. Comparing our new results to the data published previously we can summarize that the data sets in the two studies show three key results: First, the traveling ratio of RNAPII changes upon induction of MYCN. Second, RNAPII decreases at the transcription start side and third, it increases towards the end side.

      We agree that in the previous study we linked the traveling ratio directly to elongation. However performing ChIP-seq with different RNAPII antibodies showed us that for example RNAPII (N20), which is unfortunately discontinued, gives different results compared to RNAPII (A10). Combining our new results using the RNAPII (8WG16) antibody shows that the traveling ratio is not only reflecting transcription elongation but also includes that the RNAPII is kicked-off chromatin at the start side.

      In the previous study we only performed manual ChIP experiments for RNAPII (8WG16) and pSer2. Now we did a global analysis which is more meaningful and is also reflected in the RNA sequencing data.

      (3) Finally, the varied techniques employed to explore the role of TFIIIC in MYCNdependent recruitment of nascent RNA degradation factors make it challenging to draw definitive conclusions about which factor is affected and which one is not. While conducting ChIPseq experiments for all factors may be beyond the scope of this manuscript, incorporating proximity ligation assays (PLA) or ChIP-qPCR assays with each factor would have enabled a more direct and comprehensive comparison.

      We understand the criticism that we are comparing different assays. We have performed PLAs with different antibodies. Since the controls of the PLAs were not sufficient for us, we refrain from using them. ChIP-qPCR experiments are much more challenging to do side by side compared to PLAs, which is why we decided against looking at all factors with this method.

      Recommendations For The Authors:

      Reviewer #3 (Recommendations For The Authors):

      (1) Figure 2: Why did the authors choose the 8WG16 antibody? Does TFIIIC5 depletion suppress the MYCN-dependent reduction of total RNA polymerase II binding to promoters that they consistently showed in previous studies? Given that phosphorylation of the CTD impacts 8WG16 recognition, including Ser5-phosphorylated RNA polymerase II ChIPseq experiments might clarify this issue.

      We used the RNAPII (8WG16) antibody to exactly map non-phosphorylated RNAPII which shows us the binding of non-functional RNAPII.

      (2) Figures 3 and 4: As it stands, the manuscript does not convincingly establish a functional connection between the results in Figures 2, 3, and 4 or elucidate potential mechanisms. Are changes in RNA polymerase II levels upon MYCN activation more pronounced at promoters located at MYCN hubs? Do changes in MYCN-enriched chromatin contacts upon TFIIIC5 depletion somehow correlate with alterations in RNA polymerase II levels? Performing similar cross-comparisons as in Figure 3C may help address this issue. Furthermore, it not clear how the authors concluded that MYCN/TFIIIC5-bound genes are not part of these so-called promoter hubs.

      In Figure 3C we show that RNAPII levels are more pronounced upon MYCN activation at promoters located at MYCN hubs. Additionally, we show non-phosphorylated ChIP-seq on TSS and RNAPII-pSer2 ChIP-seq on TES density plots for promoters with MYCN interactions in the Figure for the Reviewers (Figure 3). We found no other difference than binding compared to the overall global analysis for all expressed genes showed in Figure 2B and Figure 2C. This goes on the same direction of the high expression observed of those genes in MYCN interactions observed in Figure 3C.

      The changes observed in Figures 2B and 2C are global and do include the promoters with MYCN interactions. At the same time, it is required a higher number of replicates to statistically distinguish the MYCN interaction differences between TFIIIC5 presence and depletion. We acknowledge this limitation, and we therefore restrain any attempt towards this end. We base our conclusions on the other parts of the manuscript and on our previous studies that show that MYCN recruits TFIIIC, BRCA1, and the exosome to promoter proximal regions 1,4,10.

      (3) Figure 5: According to the PLA results, activation of MYCN could enhance RNA polymerase II-NELFE interaction in a TFIIC5-dependent manner. Considering the raised issues regarding the use of the 8WG16 antibody, this result might be of relevance.

      Nevertheless, PLA does not seem to be the optimal technique to address these questions, and I would rather suggest performing ChIP-qPCR experiments for all the factors to be compared. Finally, do the authors conclude that the TFIIIC5 effect on MYCN-dependent changes in RNA polymerase II depends upon the recruitment of EXOSC5 and BRCA1? If so, it would be interesting to determine whether depletion of these factors phenocopies the effects observed with TFIIC5.

      We understand the criticism that we are comparing different assays. We have performed PLAs with different antibodies. Since the controls of the PLAs were not sufficient for us, we refrain from using them.

      (4) In Figure S2 the labels should be EtOH, 4-OHT, and Input.

      We changed this accordingly.

      (5) On page 7, the sentence "We have shown previously that TFIIIC5 depletion does not cause significant changes in expression of multiple tRNA genes that are transcribed by RNAPIII (Buchel et al., 2017)" appears to lack a connection.

      We agree with the reviewer and we deleted this sentence from the manuscript.

      Author response image 1.

      (A) Density plot of ChIP-Rx signal for non-phosphorylated RNAPII. Data show mean (line) ± standard error of the mean (SEM indicated by the shade) of different gene sets based on an RNA-seq of SH-EP-MYCN-ER cells ± 4-OHT. The y-axis shows the number of spike-in normalized reads and it is centered to the TES ± 2 kb. N = number of genes in the gene set defined in the methods. (B) Density plot of ChIP-Rx signal for RNAPII pSer2 as described for panel A. The signal is centered to the TSS ± 2 kb.

      Author response image 2.

      Bin dot plot for RNA-seq of SH-EP-MYCN-ER showing mRNA expression normalized by control per bin comparing the fold average using DESEQ2 (A), normalization to TMM in edgeR (B) and to quantile normalization (C).

      Author response image 3.

      Average density plot of ChIP-Rx signal for non-phosphorylated RNAPII (A) or RNAPII pSer2 (B) at promoters with MYCN interactions.

      References

      (1) Büchel, G., Carstensen, A., Mak, K.-Y., Roeschert, I., Leen, E., Sumara, O., Hofstetter, J., Herold, S., Kalb, J., and Baluapuri, A. (2017). Association with Aurora-A controls NMYC-dependent promoter escape and pause release of RNA polymerase II during the cell cycle. Cell reports 21, 3483-3497.

      (2) Yuen, K.C., Slaughter, B.D., and Gerton, J.L. (2017). Condensin II is anchored by TFIIIC and H3K4me3 in the mammalian genome and supports the expression of active dense gene clusters. Sci Adv 3, e1700191. 10.1126/sciadv.1700191.

      (3) Ferrari, R., de Llobet Cucalon, L.I., Di Vona, C., Le Dilly, F., Vidal, E., Lioutas, A., Oliete, J.Q., Jochem, L., Cutts, E., Dieci, G., et al. (2020). TFIIIC Binding to Alu Elements Controls Gene Expression via Chromatin Looping and Histone Acetylation. Mol Cell 77, 475-487 e411. 10.1016/j.molcel.2019.10.020.

      (4) Papadopoulos, D., Solvie, D., Baluapuri, A., Endres, T., Ha, S.A., Herold, S., Kalb, J., Giansanti, C., Schulein-Volk, C., Ade, C.P., et al. (2021). MYCN recruits the nuclear exosome complex to RNA polymerase II to prevent transcription-replication conflicts. Mol Cell. 10.1016/j.molcel.2021.11.002.

      (5) Lorenzin, F., Benary, U., Baluapuri, A., Walz, S., Jung, L.A., von Eyss, B., Kisker, C., Wolf, J., Eilers, M., and Wolf, E. (2016). Different promoter affinities account for specificity in MYC-dependent gene regulation. Elife 5. 10.7554/eLife.15161.

      (6) Baluapuri, A., Hofstetter, J., Dudvarski Stankovic, N., Endres, T., Bhandare, P., Vos, S.M., Adhikari, B., Schwarz, J.D., Narain, A., Vogt, M., et al. (2019). MYC Recruits SPT5 to RNA Polymerase II to Promote Processive Transcription Elongation. Mol Cell 74, 674-687 e611. 10.1016/j.molcel.2019.02.031.

      (7) Baluapuri, A., Wolf, E., and Eilers, M. (2020). Target gene-independent functions of MYC oncoproteins. Nat Rev Mol Cell Biol. 10.1038/s41580-020-0215-2.

      (8) Koch, H.B., Zhang, R., Verdoodt, B., Bailey, A., Zhang, C.D., Yates, J.R., 3rd, Menssen, A., and Hermeking, H. (2007). Large-scale identification of c-MYCassociated proteins using a combined TAP/MudPIT approach. Cell Cycle 6, 205-217. 10.4161/cc.6.2.3742.

      (9) Ferrari, R., Rivetti, C., Acker, J., and Dieci, G. (2004). Distinct roles of transcription factors TFIIIB and TFIIIC in RNA polymerase III transcription reinitiation. Proc Natl Acad Sci U S A 101, 13442-13447. 10.1073/pnas.0403851101.

      (10) Herold, S., Kalb, J., Büchel, G., Ade, C.P., Baluapuri, A., Xu, J., Koster, J., Solvie, D., Carstensen, A., and Klotz, C. (2019). Recruitment of BRCA1 limits MYCN-driven accumulation of stalled RNA polymerase. Nature 567, 545-549.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      Summary:

      The manuscript by Dubicka and co-workers on calcification in miliolid foraminifera presents an interesting piece of work. The study uses confocal and electron microscopy to show that the traditional picture of calcification in porcelaneous foraminifera is incorrect.

      Strengths:

      The authors present high-quality images and an original approach to a relatively solid (so I thought) model of calcification.

      Weaknesses:

      There are several major shortcomings. Despite the interesting subject and the wonderful images, the conclusions of this manuscript are simply not supported at all by the results. The fluorescent images may not have any relation to the process of calcification and should therefore not be part of this manuscript. The SEM images, however, do point to an outdated idea of miliolid calcification. I think the manuscript would be much stronger with the focus on the SEM images and with the speculation of the physiological processes greatly reduced.

      We agree that fluorescence studies presented in the paper are not an unequivocal proof by itself for calcification model utilised by studied Miliolida species. However, fluorescence data combined with SEM studies, especially overlap of the elements that show autofluorescence upon excitation at 405 nm (emission 420–480 nm) and acidic vesicles marked by p_H-_sensitive LysoGlow84, may be a hint indicating ACC-bearing vesicles.

      We will tone down the the physiological interpretation based on fluorescence studies in the revised version of the manuscript.

      Nevertheless, we think that our fluorescent life-imaging experiments provides important observations in miliolida, which is scarce in the existing literature, and therefore are worth being presented as they might be very helpful in better understanding of full calcification model in the future.

      Reviewer #2 (Public Review):

      Summary:

      Dubicka et al. in their paper entitled " Biocalcification in porcelaneous foraminifera" suggest that in contrast to the traditionally claimed two different modes of test calcification by rotallid and porcelaneous miliolid formaminifera, both groups produce calcareous tests via the intravesicular mineral precursors (Mg-rich amorphous calcium carbonate). These precursors are proposed to be supplied by endocytosed seawater and deposited in situ as mesocrystals formed at the site of new wall formation within the organic matrix. The authors did not observe the calcification of the needles within the transported vesicles, which challenges the previous model of miliolid mineralization. Although the authors argue that these two groups of foraminifera utilize the same calcification mechanism, they also suggest that these calcification pathways evolved independently in the Paleozoic.

      We do not argue that Miliolida and Rotallida utilize exactly the same calcification mechanism but the both groups use less divergent crystallization pathways, where mesocrystalline chamber walls are created by accumulating and assembling particles of pre-formed liquid amorphous mineral phase.

      Strengths:<br /> The authors document various unknown aspects of calcification of Pseudolachlanella eburnea and elucidate some poorly explained phenomena (e.g., translucent properties of the freshly formed test) however there are several problematic observations/interpretations which in my opinion should be carefully addressed.

      Weaknesses:

      (1) The authors (line 122) suggest that "characteristic autofluorescence indicates the carbonate content of the vesicles (Fig. S2), which are considered to be Mg-ACCs (amorphous MgCaCO3) (Fig. 2, Movies S4 and S5)". Figure S2 which the authors refer to shows only broken sections of organic sheath at different stages of mineralization. Movie S4 shows that only in a few regions some vesicles exhibit red autofluorescence interpreted as Mg-ACC (S5 is missing but probably the authors were referring to S3). In their previous paper (Dubicka et al 2023: Heliyon), the authors used exactly the same methodology to suggest that these are intracellularly formed Mg-rich amorphous calcium carbonate particles that transform into a stable mineral phase in rotaliid Aphistegina lessonii. However, in Figure 1D (Dubicka et al 2023) the apparently carbonate-loaded vesicles show the same red autofluorescence as the test, whereas in their current paper, no evidence of autofluorescence of Mg-ACC grains accumulated within the "gel-like" organic matrix is given. The S3 and S4 movies show circulation of various fluorescing components, but no initial phase of test formation is observable (numerous mineral grains embedded within the o rganic matrix - Figures 3A and B - should be clearly observed also as autofluorescence of the whole layer). Thus the crucial argument supporting the calcification model (Figure 5) is missing.

      This is correct that we did not observe the initial phase of test formation in vivo. Therefore, it is not our crucial argument supporting novel components of the new calcification model. We suspect that vesicles preparing and transporting Mg-ACC are produced way before their docking and deposition into the new wall, because such seawater vesicles were observed between the chamber formation stages (Goleń and Tyszka, 2024, personal communication based on independent experiments on a closely related miliolid taxon). It means that our in vivo experiments most likely represent a long, dynamic stage of vesicles formation via seawater endocytosis, their modification (incl. Mg-ACC formation) before the stage of exocytosis during the new chamber formation. Our crucial arguments supporting the calcification model come from the SEM imaging of the specimens fixed during chamber formation, as well as from the transparency of the new chamber wall during its progressive calcification.

      There is no support for the following interpretation (lines 199-203) "The existence of intracellular, vesicular intermediate amorphous phase (Mg-ACC pools), which supply successive doses of carbonate material to shell production, was supported by autofluorescence (excitation at 405 nm; Fig. 2; Movies S3 and S4; see Dubicka et al., 2023) and a high content of Ca and Mg quantified from the area of cytoplasm by SEM-EDS analysis (Fig. S6)."

      We used laser line 405nm and multiphoton excitaton to detect ACCs. These wavelengths (partly) permeate the shell to excite ACCs autofluorescence. The autofluorescence of the shells is present as well but not clearly visible in movieS4 as the fluorescence of ACCs is stronger. This may be related to the plane/section of the cell which is shown. The laser permeates the shell above the ACCs (short distance) but to excite the shell CaCO3 around foraminifera in the same three-dimensional section where ACCs are shown, the light must pass a thick CaCO3 area due to the three-dimensional structure of the foraminiferan shell. Therefore, the laser light intensity is reduced. In a revised version a movie/image with reduced threshold is shown.

      Author response image 1.

      Autofluorescence image of studied Miliolida species (exc. 405 nm) showing algal chlorophyll (blue) and CaCO3 (red), both ACC and calcite shell.

      It would be very convenient if it was possible to visualize ACC by illumination with a blacklight, but there are very many organic molecules that have an autofluorescence excited by ~405 nm. One of the examples is NADH (Lee et al., 2015. Kor J Physiol Pharmac 19(4): 373-382), an omnipresent molecule in any cell (couldn't copy the appropriate picture here, but the reference has a figure with the em/exc spectra).

      The paper of Lee et al. 2015 shows that the excitation spectrum of NADH is ending close to 400 nm. This means that NADH is not or only very weakly excitable at 405nm, what we used as the excitation laser line. 

      (2) The authors suggest that "no organic matter was detected between the needles of the porcelain structures (Figures 3E; 3E; S4C, and S5A)". Such a suggestion, which is highly unusual considering that biogenic minerals almost by definition contain various organic components, was made based only on FE-SEM observation. The authors should either provide clearcut evidence of the lack of organic matter (unlikely) or may suggest that intense calcium carbonate precipitation within organic matrix gel ultimately results in a decrease of the amount of the organic phase (but not its complete elimination), alike the pure calcium carbonate crystals are separated from the remaining liquid with impurities ("mother liquor"). On the other hand, if (249-250) "organic matrix involved in the biomineralization of foraminiferal shells may contain collagen-like networks", such "laminar" organization of the organic matrix may partly explain the arrangement of carbonate fibers parallel to the surface as observed in Fig. 3E1.

      We agree with the reviewer that biogenic minerals should by definition contain some organic components. We just wrote that "no organic matter was detected between the needles of the porcelain structures” that means that we did not detect any organic structures based only on our FE-SEM observations. We will rephrase this part of the text to avoid further confusion.

      (3) The author's observations indeed do not show the formation of individual skeletal crystallites within intracellular vesicles, however, do not explain either what is the structure of individual skeletal crystallites and how they are formed. Especially, what are the structures observed in polarized light (and interpreted as calcite crystallites) by De Nooijer et al. 2009? The author's explanation of the process (lines 213-216) is not particularly convincing "we suspect that the OM was removed from the test wall and recycled by the cell itself".

      Thank you for this comment. We will do our best to supplement our explanations. We are aware about the structures observed in polarized light by De Nooijer et al. (2009). However, Goleń et al. (2022, Prostist; + 2 other citations) showed that organic polymers may also exhibit light polarization. Additional experimental studies are needed to separate these types of polarization. We will try to investigate this issue in our future research.

      (4) The following passage (lines 296-304) which deals with the concept of mesocrystals is not supported by the authors' methodology or observations. The authors state that miliolid needles "assembled with calcite nanoparticles, are unique examples of biogenic mesocrystals (see Cölfen and Antonietti, 2005), forming distinct geometric shapes limited by planar crystalline faces" (later in the same passage the authors say that "mesocrystals are common biogenic components in the skeletons of marine organisms" (are they thus unique or are they common)? It is my suggestion to completely eliminate this concept here until various crystallographic details of the miliolid test formation are well documented.

      Our intension was to express that mesocrystals are common biogenic components in the skeletons of marine organisms however such a miliolid needles forming distinct geometric shapes limited by planar crystalline faces are unique.

      Reviewer #1 (Recommendations For The Authors):

      Below, I have summarized my main criticisms.

      (1) The movies S1-S4 do not indicate what is described. The videos show indeed seawater (S1), cell membranes (S2), and autofluorescence and acidic vesicles (S3 and S4). The presence of all these intracellular structures is not surprising: any eukaryotic cell will have those. The authors, however, claim that they participate in the process of calcification, which is simply not shown. One of the main arguments seems the presence of 'carbonate pools', in the caption these are even claimed to be 'Mg-ACC pools', but this is by no means revealed by an excitation of 405nm/ emission between 420 and 490 nm. It would be very convenient if it was possible to visualize ACC by illumination with a blacklight, but there are very many organic molecules that have an autofluorescence excited by ~405 nm. One of the examples is NADH (Lee et al., 2015. Kor J Physiol Pharmac 19(4): 373-382), an omnipresent molecule in any cell (couldn't copy the appropriate picture here, but the reference has a figure with the em/exc spectra).

      The paper of Lee et al. 2015 shows that the excitation spectrum of NADH is ending close to 400 nm. This means that NADH is not or only very weakly excitable at 405nm, what we used as the excitation laser line. 

      The fluorescence by this excitation/ emission couple unlikely indicates the vesicles in which these foraminifera calcify. Therefore, most of the interpretation of the authors on what happens with the calcitic needles is not based on results but remains pure speculation.

      The fluorescence autofluorescence upon excitation at 405 nm (emission 420–480 nm is typical for CaCO3 both for biocalcite and amorphous calcium carbonate, what was proven by laboratory synthesis of amorphous calcium carbonate (Dubicka et al., in preparation).

      (2) The results mention 'granules', which are the supposed Mg-ACC-containing vesicles, but the movies simply don't show any granules. Only fluorescence. Again, the results show a lot of vesicles with autofluorescence, but these are not necessarily related to calcification. Proof could be supplied by showing that the same fluorescent vesicles are 'used up' when the specimens under observation are making a new chamber, but until that is done, the fate of all these vesicles remains uncertain and once more, may not be involved in calcification at all.

      We suspect that vesicles preparing and transporting Mg-ACC are produced way before their docking and deposition into the new wall, because such seawater vesicles were observed between the chamber formation stages (Goleń and Tyszka, 2024, personal communication based on independent experiments on a closely related miliolid taxon). It means that our in vivo experiments most likely represent a long, dynamic stage of vesicles formation via seawater endocytosis, their modification (incl. Mg-ACC formation) before the stage of exocytosis during the new chamber formation. Our crucial arguments supporting the calcification model come from the SEM imaging of the specimens fixed during chamber formation, as well as from the transparency of the new chamber wall during its progressive calcification.

      (3) The Methods are unclear. How long were the foraminifers kept before being placed under the microscope? Were they fed with anything? This is important since the chlorophyll should not be from any food source. I didn't know that this foraminiferal species has photosynthetic symbionts: genera like Quinqueloculina don't. Is there any reference for this? Normally, I wouldn't care that much, but the authors find the presence of (facultative) symbionts important (lines 305-336). I am a bit suspicious about this since the only evidence for the presence of photosynthetic symbionts is because of the autofluorescence. As the authors said, commonly these miliolid species are regarded as symbiont-barren, so additional proof for these symbionts is necessary.

      We agree that additional proof is needed for the presence of photosynthetic symbionts. We rephrased the manuscript accordingly.

      (4) It is also unclear (Methods) at what stage the miliolids were photographed (Figure 3). How did chamber formation proceed, what was the timing of the photographs, etc. These pictures are to me the most interesting finding of this study, but need to be described much better.

      All individuals of living foraminifera were fixed at the overall stage of chamber formation. However, every individual presents a complete set of successive steps (substages) of chamber wall calcification fixed at once. Fig. 3A and B present nearly the most proximal (youngest) part of the new chamber with a thick wall of calcite nanograins within a gel-like organic matrix. Fig. 3C and D present a bit more distal (intermediate) part of the calcified chamber. Fig. 3E shows the most distal part of the new chamber. This part is anchored to the older, underlying solid calcified chamber (not shown in this figure). All these steps are synchronous, however, represent gradual successive stages of calcification. The main text and Figs 4 and 5 explain this phenomenon in details.

      There are many small issues with the text too. These include:

      Line 28/29: in many other groups, calcification is thought to be polyphyletic (e.g. sponges: Chombard et al., 1997. Biol Bull 193: 359-367).

      Corrected

      Line 29/30: there may be even more 'types of shells'. The first author has shown in earlier papers that nodosarids have a unique shell architecture. Spirillinids also seem to have their own way of calcification. It is unclear what is meant here by 'two contrasting models'.

      By now there are known only two models of foraminiferal calcification. Lagenida biocalcification has not been studied.

      Line 33: 'Both groups'? This paper only shows calcification in miliolids.

      However, we refer to previous study.

      Line 42: Perhaps, but there is no data on the pseudopodial network in this manuscript.

      We refer to Angell, 1980 studies

      Line 43: Likely, but that is not what this manuscript is showing.

      Line 42-44: The authors should make a choice and be clear. The point of this paper is that miliolids and rotalids calcify in ways that are actually not as different as they seemed previously. Still, they are said to have different 'chamber formation modes'. If they are calcifying in a similar way (which I think is not necessarily supported by the results), isn't calcification in these groups like variations on the same theme? How does this relate to the independent origins of calcification within these two groups?

      Our intension is to show that Miliolida and Rotaliida utilize less divergent calcification pathways, following the recently discovered biomineralization principles.

      Line 49-51: is this a well-established distinction? If so, please add a reference. If not: what is fundamentally different between B and C? Does only the size of the intracellular vesicle matter?

      Rephrased

      Line 60: please include a reference for the intracellular calcification by coccolithophores.

      Added

      Line 67: this is wrong. It is the alignment of the needles at the surface that makes them all reflect light in the same way and gives the shells a porcelaneous appearance. A close-up of the miliolid's shell surface shows this arrangement. Underneath this layer, the orientation of the needles is more random.

      We referred to Johan Hohenegger papers.

      Line 114: how else?

      Line 114-116: I don't see the relevance here. If seawater is taken up, the vesicle containing this seawater has to have a membrane around it. By definition. The text here ('These vesicles') suggests that Calcein and FM1-43 were combined (which they easily could have), but the methods describe that they are used successively.

      Yes, we used two dyes separately.

      Lines 122-130: I think the interpretation of this autofluorescence signal is wrong. Even if it was true, these lines belong to the Discussion.

      This paragraph has been placed within discussion

      Line 138: What are 'mobile clusters'? I don't see a relation between the location of the symbionts and the other vesicles (Figure 2).

      Line 147-148: How can an SEM image show the absence of organic matter?

      We meant the absence of the gel-like OM visible in the previous stages of the chamber formation

      Line 148: Should be 'Figs. 3E; 3E1; S4C'.

      Corrected

      Lines 143-150: this can be merged with the following paragraph.

      Done

      Lines 151-169: why is there no indication of the time? Figures 3 and 4 link the pictures in time to show the development of the growing chamber wall. However, neither here nor in the methods, is there any recording of the time after the beginning of chamber formation. Now, the images are linked (Figure 4) as if they were taken at regular intervals, but this is not documented.

      Lines 170-184: this should go to the Discussion.

      Done

      Line 193-195: this is likely, but not visible in Figure 1.

      It was visible by optical microscopy and described by Angell, 1980

      Line 199-201: I don't understand this: the fluorescent vesicles were not observed during chamber formation so any link between the SEM and CLSM scans remains pure speculation.

      Line 203-204: needed for what?

      For better documentation of Miliolid ACC-bearing granules

      Line 220: is this shown in any of the images? 

      Angell, 1980

      Line 230: It sounds nice, but I don't think a 'paradigm shift' is appropriate here. However interesting and important foraminiferal biomineralization is, the authors show that the crystals of miliolids are likely formed differently than previously thought. If this is a 'paradigm shift', then most scientific findings are.

      In our opinion this is definitely a shift of paradigm

      Line 231: I don't think anyone suggested miliolids and coccolithophores share 'the same' pathway. They are shown (cocco's) and thought (miliolids) to secrete their calcite intracellularly.

      Changed to similar, intracellular

      Line 258: References should only be to peer-reviewed studies.

      Line 430: Burgers'

      Corrected

      Reviewer #2 (Recommendations For The Authors):

      Please separate clearly the results (observations) from the discussion (interpretations): various interpretational/commentary phrases should be removed from the Results section to Discussion e.g., lines 124-130, 131-135.

      Interpretation have been separated from results as suggested by Reviewer.

      [line 49] " living cells have evolved three major skeleton crystallization pathways". I would rather say "organisms" not "cells" as the coordination of the calcification process in multicellular organisms clearly involves processes that are beyond the individual cell activity.

      Corrected

    1. Author response:

      The following is the authors’ response to the original reviews.

      Response to the Reviewer #1 (Public review):

      We greatly appreciate the reviewer’s high evaluation of our paper and helpful comments. As expected, we revealed that the CCL17/CCL22–CCR4 axes play an important role in guiding Tregs to the atherosclerotic aorta. Interestingly, we also demonstrated that these axes are critical for Treg-dependent regulation of proinflammatory T cell responses in lymphoid tissues and atherosclerotic aortas, which is a previously unrecognized role for CCR4 in regulating inflammatory immune responses. However, the role of the CCL17/CCL22–CCR4 axes in regulating inflammatory immune responses and atherosclerosis has not been fully elucidated and further investigation is needed.

      Response to the reviewer #2 (Public review):

      We greatly appreciate the reviewer’s high evaluation of our paper and helpful comments and suggestions. We isolated CD4<sup>+</sup>CD25<sup>+</sup> T cells and used them as Tregs in several experiments. As the reviewer pointed out, we realize that CD4<sup>+</sup>CD25<sup>+</sup> T cell population contains some activated effector T cells. However, in consideration of the high expression levels of the most reliable Treg marker Foxp3 in isolated CD4<sup>+</sup>CD25<sup>+</sup> T cells determined by flow cytometry, we believe that our method for separating Tregs would be acceptable.

      Regarding the role of Th17 cells in atherosclerosis, conflicting results have been reported. Therefore, it is unclear whether augmented Th17 cell immune responses contribute to accelerated atherosclerosis in Ccr4<sup>-/-</sup>Apoe<sup>-/-</sup> mice.

      As the reviewer pointed out, it is important to consider the clinical relevance of our findings. We analyzed public database to determine if Ccr4 single nucleotide polymorphisms correlate with a higher incidence of atherosclerotic cardiovascular disease. However, no evidence supporting the clinical relevance of our findings was found.

      Response to the Reviewer #3 (Public review):

      We greatly appreciate the reviewer’s high evaluation of our paper and helpful comments and suggestions. In accordance with the reviewer’s suggestion, we described the detailed methods and carefully performed data analysis regarding flow cytometry, which would strengthen the conclusion of this study.

      We understood the importance of reviewer’s claim that CCR4 deficiency does not shift the Th1 cell/Treg balance toward Th1 cell responses in all lymphoid tissues. CCR4 deficiency promoted the accumulation of Th1 cells but did not affect the accumulation of Tregs in the atherosclerotic aorta, which led to the shift of the Th1 cell/Treg balance toward Th1 cell responses. The frequencies of both Tregs and Th1 cells in peripheral lymphoid tissues were increased by CCR4 deficiency, while these CCR4-deficient Tregs exhibited impaired suppressive function. Given this, we speculate that CCR4 deficiency may shift the Th1 cell/Treg balance toward Th1 cell responses in peripheral lymphoid tissues. However, it is difficult to clearly show this. We revised the manuscript accordingly.

      Although the reviewer pointed out the possibility that modulation of the Th1 cell/Th17 cell balance might be responsible for the changes in aortic inflammatory cells in Ccr4<sup>-/-</sup>Apoe<sup>-/-</sup> mice, the role of Th17 cells in atherosclerosis remain controversial. However, we cannot completely exclude the possibility of the involvement of the Th17 response modulation in accelerated atherosclerosis in Ccr4<sup>-/-</sup>Apoe<sup>-/-</sup> mice.

      As the limitation of this study, the phenotypic heterogeneity and dynamics of aortic leukocytes could not be revealed by flow cytometric analysis. Single-cell proteomic and transcriptomic approaches would provide additional important information on various aortic cells including immune cells and vascular cells.

      Reviewer #1 (Recommendations for the authors):

      Issue (1) Ideally, CCR4 could be deleted on Foxp3+ cells and some staining on double positive Rorg+Foxp3+ done. On the other side, a whole gene expression of infiltrated Foxp3 and effector could be also helpful. More challenging, it would be important to see whether those CCR4-specific Trges could or not regulate effector infiltrating cells.

      As the reviewer suggested, single-cell proteomic and transcriptomic approaches would be helpful to reveal the phenotypic heterogeneity and dynamics of aortic leukocytes including Tregs. Also, the use of conditional knockout mice would reveal the precise role of CCR4-expressing Tregs in regulating aortic immune cell infiltration and atherosclerosis.

      Reviewer #2 (Recommendations for the authors):

      Minor Suggestions:

      Issue (1) In supplementary Figure 1, CCR4 expression would be better represented by dot plots rather than histograms.

      We revised Supplementary Figure 1A through 1C.

      Issue (2) The reduction in CD103 expression shown in Figure 2E at 8 weeks should be discussed.

      In Figure 2E, we found that the expression of CD103 in peripheral LN Tregs was slightly lower in 8-week-old Ccr4<sup>-/-</sup>Apoe<sup>-/-</sup> mice than in age-matched Apoe<sup>-/-</sup> mice, while there was no difference in its expression levels between 18-week-old Apoe<sup>-/-</sup> and Ccr4<sup>-/-</sup>Apoe<sup>-/-</sup> mice. In addition, there was no significant difference in the mRNA expression of this molecule in splenic Tregs between 8-week-old Apoe<sup>-/-</sup> and Ccr4<sup>-/-</sup>Apoe<sup>-/-</sup> mice. Based on the minor effect of CCR4 deficiency on CD103 expression in Tregs, reduced CD103 expression in Ccr4<sup>-/-</sup>Apoe<sup>-/-</sup> mice does not seem to be an important change.

      Issue (3) The increased expression of CD86 by DCs should be discussed.

      The upregulated CD86 expression on DCs in Ccr4<sup>-/-</sup>Apoe<sup>-/-</sup> mice might be explained by the data on a Treg-DC coculture experiment showing the impaired cell–cell contacts between CCR4-deficient Tregs and DCs. On the other hand, the expression of another important costimulatory molecule CD80 on DCs was not altered in these mice, which is not consistent with the data on the above coculture experiment. The reason why only CD86 expression on DCs was upregulated in Ccr4<sup>-/-</sup>Apoe<sup>-/-</sup> mice remains unclear.

      Issue (4) In Figures 5F-H, using larger dots would enhance visibility.

      We revised the graphs in Figure 5F-H.

      Issue (5) In Figure 5I, since the data is normalized, a one-sample t-test is more appropriate.

      In accordance with the reviewer’s suggestion, we reconsidered the data analysis. Because there was a dramatic difference in the absolute number of Kaede-expressing Tregs accumulated in the aorta among experiments, we were worried that the statistical analysis of the combined data from multiple experiments might draw a wrong conclusion. We have decided to show the representative data from 3 independent experiments in Figure 5I.

      Issue (6) On page 11, line 256, the text mentions IL4 and IL10 being detected by cytokine array; however, the figures do not show these cytokines.

      We are afraid that the reviewer might have misunderstood the data. The cytokine levels of IL-4 and IL-10 could not be detected by cytokine array analysis. Accordingly, we carefully revised the text in the manuscript.

      Issue (7). On page 14, lines 326-330, the text should be revised for clarity.

      We revised the text in the manuscript.

      Issue (8) Several data are marked as "not shown"; some of this information is relevant and should be included in the supplementary figures.

      We showed the data on CCL17 and CCL22 expression in peripheral LNs in Supplementary Figure 2.

      Major Suggestions:

      Issue (1) FoxP3 expression should be evaluated post-isolation of CD4<sup>+</sup>CD25<sup>+</sup> T cells, and FoxP3- CD4<sup>+</sup>CD25<sup>+</sup> T cells should be characterized. Tregs could be more effectively isolated using FoxP3eGFP mice.

      After isolation of CD4<sup>+</sup>CD25<sup>+</sup> T cells (the purity was >95%), we examined Foxp3 expression by flow cytometry and found that most of these cells express Foxp3 (Supplementary Figure 10). Therefore, CD4<sup>+</sup>CD25<sup>+</sup> T cells without Foxp3 expression, which are considered contaminated effector T cells, are minor cells and would not substantially affect the results. Nonetheless, the use of Foxp3-eGFP mice would enable us to isolate Tregs more accurately.

      Issue (2) In Figure 3, it would be interesting to evaluate whether there are RORgt+Tbet+ (IL17+IFNg+) cells. These cells would be pathogenic, whereas RORgt+CD73+ cells would be non-pathogenic.

      We analyzed CD4<sup>+</sup> T cells producing both IL-17 and IFN-γ in the peripheral lymphoid tissues of Apoe<sup>-/-</sup> and Ccr4<sup>-/-</sup>Apoe<sup>-/-</sup> mice. We found that this cell population was quite rare and that there was no significant difference its proportion between the 2 groups, suggesting the possible minor contribution of this cell population to the atherosclerosis phenotype.

      Author response image 1.

      Issue (3) Different time points after adoptive cell transfer should be evaluated to confirm reduced migration to the atherosclerotic aorta.

      It would be interesting to evaluate Treg migration to the atherosclerotic aorta at different time points after Treg transfer. However, it seems difficult to accurately evaluate the migration of Tregs at later time points because they would proliferate in the aorta.

      Issue (4) The authors could evaluate whether Ccr4 SNPs correlate with an increased risk of atherosclerosis.

      As the reviewer pointed out, it is important to consider the clinical relevance of our findings. However, there is no evidence supporting that Ccr4 single nucleotide polymorphisms correlate with a higher incidence of atherosclerotic cardiovascular disease.

      Issue (5) The authors could evaluate if the transfer of Apoe<sup>-/-</sup> Tregs rescues early atherosclerosis development in Ccr4<sup>-/-</sup>Apoe<sup>-/-</sup> mice.

      To confirm whether transfer of CCR4-intact Tregs rescues the development of early atherosclerotic lesions in Ccr4<sup>-/-</sup>Apoe<sup>-/-</sup> mice, we injected Ccr4<sup>-/-</sup>Apoe<sup>-/-</sup> mice with saline or Tregs from Apoe<sup>-/-</sup> or Ccr4<sup>-/-</sup>Apoe<sup>-/-</sup> mice and analyzed the aortic root atherosclerotic lesions of recipient Ccr4<sup>-/-</sup>Apoe<sup>-/-</sup> mice. However, we found no significant difference in the aortic sinus plaque area among the 3 groups. We described this result in the results section and included the data in Supplementary Figure 8.

      Reviewer #3 (Recommendations for the authors):

      Analysis of TCD4<sup>+</sup> cell populations in different tissues:

      Issue (1) The description of flow cytometry analysis is incomplete and requires clarification. Please detail the use of controls to ensure correct analysis, including the following: i) cell viability; ii) staining controls to define positive and negative cells; iii) the gating strategy used to identify cell populations in each lymphoid tissue and aorta (please provide them as supplementary figures).

      As we thought that most of the prepared cells would be viable, we did not check their viability. Based on our previous work where various immune cells including Tregs, effector memory T cells, and helper T cell subsets were clearly detected, in this study we performed flow cytometric analysis of these immune cells without preparing negative controls stained with isotype control antibodies. The gating strategy of flow cytometric analysis of various immune cells in peripheral lymphoid tissues was reported in our previous report (J Am Heart Assoc 2024; 13: e031639). We provided the gating strategy of flow cytometric analysis of helper T cells and Tregs in the aorta in Supplementary Figure 9.

      Issue (2) The phenotype/differentiation markers used for analysing T CD4<sup>+</sup> cell subsets differ between lymphoid tissues and aortic lesions; might this influence results? If so, please comment on that.

      As the number of aortic T cells was quite few compared with that in peripheral lymphoid tissues, it seemed difficult to precisely detect aortic T cells including various helper T cell subsets and Tregs by intracellular cytokine staining. Therefore, we decided to analyze these cells by evaluating transcription factors specific for helper T cell subsets. The difference in the markers used for analyzing T cell subsets would not considerably influence the results.

      Issue (3) Considering my observations about the effect of CCR4 deficiency on the T CD4<sup>+</sup> differentiation profile in different tissues, I suggest comparing Th1/Treg and Th17/Treg ratios in all examined tissues. The modulation of the Th17/Th1 balance could shape inflammation.

      The Th1 cell/Treg balance is shifted toward Th1 cell responses in the atherosclerotic aorta of Ccr4<sup>-/-</sup>Apoe<sup>-/-</sup> mice, while this balance would not be altered in the peripheral lymphoid tissues. It remains unclear whether CCR4 deficiency affects the Th17 cell/Treg ratio. We do not think that it is important to investigate the effect of CCR4 deficiency on the balance of Th17 cell/Treg or Th17 cell/Th1 cell because the role of Th17 cell responses in atherosclerosis remains controversial.

      Issue (4) Cell numbers of recovered Treg from para-aortic lymphoid nodes and aortic tissues might not allow Treg functional assays. Analysis by flow cytometry of biomarkers of Treg activation state would be more informative than by quantifying mRNA expression levels. In particular, TGFβ analysis at the mRNA level does not provide much more information about the suppressive activity of Treg, and even at the protein level, the recognition of the active form of this cytokine is required. Analysis of PD1 (for exhausted cell phenotype) and Treg apoptosis along the stages of atherosclerosis could also yield useful information.

      We performed flow cytometric analysis of activation markers CTLA-4 and CD103, cell exhaustion marker PD1, and apoptosis in Tregs in the para-aortic LNs of Apoe<sup>-/-</sup> or Ccr4<sup>-/-</sup>Apoe<sup>-/-</sup> mice, and found no major differences in the expression levels of these molecules or the proportion of apoptotic cells between the 2 groups. We showed these data below.

      Author response image 2.

      Unfortunately, we failed to evaluate the activity of TGF-β in Tregs because an appropriate experimental method for precisely detecting its active form was unavailable.

      Issue (5) Regarding the result´s interpretation, I recommend being precise when concluding to avoid misunderstanding. A shift in the T CD4<sup>+</sup> response in lymphoid tissues might be interpreted as a modulation of the T cell differentiation process, which strongly depends on signals derived from DCs, which were not the focus of this study.

      There are two possible mechanisms for the altered CD4<sup>+</sup> T cell responses in peripheral lymphoid tissues, which include the modulation of their differentiation and proliferation processes. These processes are substantially regulated by DCs whose function could be favorably modulated by CCR4-expressing Tregs as described in the manuscript. Therefore, we think that the interactions between Tregs and DCs are crucial for shifting the CD4<sup>+</sup> T cell responses in peripheral lymphoid tissues, though it remains unclear which process plays a major role in regulating CD4<sup>+</sup> T cell polarization.

      Suppression studies:

      Issue (1) In vitro assays. According to the methodology suppression studies were performed using Treg collected from peripheral lymphoid nodes and spleen, but it is unclear whether these cells were analysed separately or as a pool (this was not clarified in the legend of Figure 5 either). Besides, be precise about which cells were used as antigen-presenting cells in the Treg suppression assay.

      In in vitro Treg suppression assay, we used Tregs purified from peripheral lymph nodes and spleen as a pool. We used splenocytes as antigen-presenting cells in Treg suppression assay. We revised the manuscript accordingly.

      Issue (2) Obtaining CD4<sup>+</sup>CD25<sup>+</sup> and CD4<sup>+</sup>CD25-. The control of the purity and viability of cell preparations from CCR4 deficient and CCR4 sufficient Apoe<sup>-/-</sup> mice should be included as a supplementary material; these purified cells were used in in vitro suppressive assays and in vivo cell transfer experiments, being relevant information to guarantee results. Since this control was performed by flow cytometry, I wonder whether Foxp3 levels were also checked.

      We included the data on the purity and viability of CD4<sup>+</sup>CD25<sup>+</sup> Tregs and CD4<sup>+</sup>CD25<sup>-</sup> T cells from Apoe<sup>-/-</sup> or Ccr4<sup>-/-</sup>Apoe<sup>-/-</sup> mice in Supplementary Figure 10. After the isolation of CD4<sup>+</sup>CD25<sup>+</sup> T cells, we examined Foxp3 expression by flow cytometry and found that most of these cells express Foxp3.

      Issue (3) For in vitro assays, IL-2, IL-10, and TGFβ measurement in culture supernatants could confirm and provide more information about Treg function.

      As both CD4<sup>+</sup>CD25<sup>+</sup> Tregs and CD4<sup>+</sup>CD25<sup>-</sup> T cells would produce various cytokines in in vitro Treg suppression assay, it is difficult to determine which cells mainly produce the above cytokines. Therefore, measurement of these cytokines would not provide more information about Treg function.

      Issue (4) It would be interesting to assess whether CCR4-mediated DC-Treg interaction is equally important to regulate Th1 than Th17 and Th2 activation; this likely requires using different settings to favour each activation profile.

      Based on our findings, we speculate that CCR4 may play an important role in regulating not only Th1 cell responses but also Th2 and Th17 cell responses by maintaining the interactions between Tregs and DCs. However, it may not be meaningful to investigate the effect of CCR4 deficiency on these T cell responses because the roles of Th2 and Th17 cell responses in atherosclerosis remain controversial.

      Issue (5) The authors showed that the presence of Treg decreased CD80 and CD86 surface levels in DCs in vitro, remarking a lower capacity of Treg derived from CCR4-deficient mice (Figure 5B). However, the fact that CD86 on splenic CD11c+MHC-II+ DCs in 8-week-old Ccr4<sup>-/-</sup>Apoe<sup>-/-</sup> mice was significantly higher than in Apoe<sup>-/-</sup> was underestimated (Supplementary Figure 4). This data needs reconsideration as it might indicate an in vivo more permissive activation state of DCs in Ccr4<sup>-/-</sup>Apoe<sup>-/-</sup> mice than in Apoe<sup>-/-</sup> mice, explaining the augmented effector T cell response observed in these mice (Figure 2).

      Our finding of the upregulated CD86 expression on DCs in Ccr4<sup>-/-</sup>Apoe<sup>-/-</sup> mice could be explained by the data on a Treg-DC coculture experiment showing the impaired ability of CCR4-deficient Tregs to downregulate CD80 and CD86 expression on DCs. As the reviewer pointed out, our data may indicate more permissive activation state of DCs and subsequent augmentation of effector T cell responses in Ccr4<sup>-/-</sup>Apoe<sup>-/-</sup> mice, which may be derived from impaired Treg suppressive function.

      Assays for chemokine levels and influence on T cell activation and traffic:

      Issue (1) Considering the findings described by Döring et al. (reference 24 in the paper), monitoring CCL22, CCL17, and CCL3 levels in the aorta and lymph nodes along atherosclerosis development would help in understanding when and how CCL17/CCL20-CCR4 might influence T cell activation and traffic. I wonder whether these chemokines were assayed by qPCR in lymphoid nodes and aorta from CCR4-deficient and sufficient Apoe<sup>-/-</sup> mice. The authors report that CCR8 (capable also of binding CCL17) was unaltered by CCR4 deficiency in splenic and para-aortic lymph nodes Treg from 8 and 18 weeks-old mice, respectively (Supplementary Figure 5 and 6), although a trend towards a high-level was observed for splenic Treg. It would be informative to evaluate CCR8 Treg levels along with atherosclerosis progress.

      As it is considered that the mRNA expression levels of chemokines do not necessarily reflect their protein expression levels, we did not analyze the mRNA expression of Ccl17 or Ccl22 by quantitative reverse transcription PCR. Instead of this, we evaluated the protein expression of CCL17 and CCL22 not only in the aorta but also in the peripheral lymph nodes of 18-week-old wild-type, Apoe<sup>-/-</sup>, and Ccr4<sup>-/-</sup>Apoe<sup>-/-</sup> mice by immunohistochemistry. We found no marked differences in their expression levels in peripheral lymph nodes among these mice and included the data in Supplementary Figure 2.

      As we focused on the role of the CCL17/CCL22–CCR4 axes in atherosclerosis, we did not examine the expression of CCL3 that is not directly related to these axes. The evaluation of CCR8+ Treg proportion is beyond the scope of this study, though we are interested in the change of this population by CCR4 deficiency associated with atherosclerotic lesion development.

      Issue (2) According to IFNγ and IL-17 expressing TCD4<sup>+</sup> subclasses, Th1 and Th17 cell subset levels increase in the spleen (Figure 3B-D) and para-aortic lymphoid nodes (Figure 4E) in CCR4 absence. A comparison of the CCR4 dependence for the migration of Th17 and Th1 cell subsets to the aorta was not performed in this atherosclerosis model; this study could help to understand the mechanisms associated with the aortic inflammation development.

      To evaluate the migration of Th1 or Th17 cells in the aorta, we need to specifically isolate them from the peripheral lymphoid tissues of Apoe<sup>-/-</sup> or Ccr4<sup>-/-</sup>Apoe<sup>-/-</sup> mice and adoptively transfer them into recipient Apoe<sup>-/-</sup> mice. However, it is impossible to isolate alive Th1 or Th17 cells because specific cell surface markers that enable us to separate these cells are unavailable.

      Issue (3) The numbers of Kaede Treg cells detected in the aorta were extremely low in both Apoe<sup>-/-</sup> and Ccr4<sup>-/-</sup>Apoe<sup>-/-</sup> mice (Figure 5I), opening results to question. Besides, the flow cytometry assay used for determining Kaede Treg cells in tissues was not well described. How were cell viability and formation of doublets examined to avoid artefacts? The gating strategy used to ensure a confident analysis of Kaede Tregs, particularly in the aorta, should be included as supplementary material.

      The extremely low number of Kaede-expressing Tregs migrated in the aorta of Apoe<sup>-/-</sup> and Ccr4<sup>-/-</sup>Apoe<sup>-/-</sup> mice may be derived from the small number of the transferred Tregs. As another explanation for this finding, Tregs may rarely migrate in the aorta under hypercholesterolemic conditions. We did not check the viability or doublets of Kaede-expressing Tregs because we thought that such experimental procedures would not considerably affect the results. We provided the gating strategy of flow cytometric analysis of Kaede-expressing Tregs in peripheral lymphoid tissues and aortas in Supplementary Figure 11.

      Other comments:

      Issue (1) As an alternative for statistical data analysis from independent experiments, two-way ANOVA with Tukey's post hoc (for data normally distributed) or the Mack Skillings exact test with Conover´s post hoc multiple comparison test (for a two-way layout in non-parametric conditions) could improve analysis.

      We performed statistical analysis in Figure 5A according to the reviewer’s suggestion.

      Issue (2) For future work, employing recombinant pseudo-receptor proteins capable of neutralizing chemokines (doi: 10.1016/j.jhep.2021.08.029) might help as an alternative to complete knockout mice.

      We thank the reviewer for giving us the information on an interesting approach as an alternative to CCR4-deficient mice.

    1. Author Response

      The following is the authors’ response to the original reviews.

      REVIEWER 1

      The claim that olivooid-type feeding was most likely a prerequisite transitional form to jet-propelled swimming needs much more support or needs to be tailored to olivooids. This suggests that such behavior is absent (or must be convergent) before olivooids, which is at odds with the increasing quantities of pelagic life (whose modes of swimming are admittedly unconstrained) documented from Cambrian and Neoproterozoic deposits. Even among just medusozoans, ancestral state reconstruction suggests that they would have been swimming during the Neoproterozoic (Kayal et al., 2018; BMC Evolutionary Biology) with no knowledge of the mechanics due to absent preservation.

      Thanks for your suggestions. Yes, we agree with you that the ancestral swimming medusae may appear before the early Cambrian, even at the Neoproterozoic deposits. However, discussions on the affinities of Ediacaran cnidarians are severely limited because of the lack of information concerning their soft anatomy. So, it is hard to detect the mechanics due to absent preservation. Olivooids found from the basal Cambrian Kuanchuanpu Formation can be reasonably considered as cnidarians based on their radial symmetry, external features, and especially the internal anatomies (Bengtson and Yue 1997; Dong et al. 2013; 2016; Han et al. 2013; 2016; Liu et al. 2014; Wang et al. 2017; 2020; 2022). The valid simulation experiment here was based on the soft tissue preserved in olivooids.

      While the lack of ambient flow made these simulations computationally easier, these organisms likely did not live in stagnant waters even within the benthic boundary layer. The absence of ambient unidirectional laminar current or oscillating current (such as would be found naturally) biases the results.

      Many thanks for your suggestion concerning the lack of ambient flow in the simulations. We revised the section “Perspectives for future work and improvements” (lines 381-392 in our revised version of manuscript). Conducting the simulations without ambient flow can reduce the computational cost and, of course, making the simulation easier, while adding ambient flow can lead to poorer convergency and more technical issues. Meanwhile, we strongly agreed that these (benthic) organisms did not live in stagnant waters, as discussed in Liu et al. 2022. However, reducing computational complexity is not the main reason that the ambient flow was not incorporated in the simulations. As we discussed in section “Perspectives for future work and improvements”, our work focuses on the theoretical effect caused by the dynamics (based on fossil observation and hypothesis) of polyp on ambient environment (i.e., how fast the organism inhales water from ambient environment) rather than effect caused by ambient flow on organism (e.g., drag forces), which was what previous palaeontological CFD simulations mainly focused based on fossil morphology and hydrodynamics. To this end, we mainly concern the flow velocity above or near peridermal aperture (and vorticity computed in this paper) generated only by polyp’s dynamics itself without the interference of ambient flow (as many CFD simulations for modern jellyfish, i.e., McHenry & Jed 2003; Gemmell et al. 2013; Sahin et al. 2009. All those simulations were conducted under hydrostatic conditions). Adding ambient flow to our simulations “biases” the flow velocity profiles we expect to obtain in this case.

      Nevertheless, we do agree that the ambient unidirectional laminar current or oscillating current plays an important role in feeding and respiration behavior of Quadrapyrgites. Further investigations need to be realized by designing a set of new insightful simulations and is beyond the scope of this work. We conducted CFD simulations incorporated with a randomly generated surface that imitated uneven seabed, where unidirectional laminar current and oscillating current (or vortex) were formed and exerted on Quadrapyrgites located in different places on the surface (Zhang et al. 2022). We assumed that combining the method we used in Zhang et al. 2022 and the velocity profiles collected in this work to conduct new simulations may be a promising way to further investigate the effect of the ambient current on organisms’ active feeding behavior.

      There is no explanation for how this work could be a breakthrough in simulation gregarious feeding as is stated in the manuscript.

      Thanks for your suggestion. We revised the section “Perspectives for future work and improvements” (lines 396-404 in our revised version of manuscript).

      Conducting simulations of gregarious active feeding behavior generally need to model multi (or clustered) organisms, which is beyond the present computational capability. However, exploiting the simulation result and thus building a simplified model can be possible to realize that, as we may apply an inlet or outlet boundary condition to the peridermal aperture of Quadrapyrgites with corresponding exhale or inhale flow velocity profiles collected in this work. By doing this we can obtain a simplified version of an active feeding Quadrapyrgites model without using computational expensive moving mesh feature. Such a model can be used solely or in cluster to investigate gregarious feeding behavior incorporated with ambient current. Those above are explicit explanations for how this work could be a “breakthrough” in simulation gregarious feeding. However, we modified the corresponding description in section “Perspectives for future work and improvements” to make it more appropriate.

      Throughout the manuscript there are portions that are difficult to digest due to grammar, which I suspect is due to being written in a second language. This is particularly problematic when the reader is attempting to understand if the authors are stating an idea is well documented versus throwing out hypotheses/interpretations.

      Thanks. Our manuscript was checked and corrected by a native speaker of English again.

      Line-by-line:

      L023: "Although fossil evidence suggests..."

      L026: "demonstrated" instead of "proven"

      We corrected them accordingly.

      L030: "The hydrostatic simulations show that the..." Maybe I'm confused by the wording, but shouldn't this be the case since it's a set part of the model?

      As is demonstrated in our manuscript, all the simulations were conducted under “hydrostatic” environment. We originally intend to use the description “hydrostatic” here to emphasize the simulation condition we set in our work. However, it can literally lead to misunderstanding that some of the simulations we conducted are “hydrostatic” while the others are not. To this end, deleting the word “hydrostatic” here (line 30) may be appropriate to eliminate confusion.

      L058: "lacking soft tissue" Haootia preservation suggests it is soft tissue (Liu et al., 2014), unless the preceding sentence is not including Haootia, in which case this section is confusingly worded

      Thank you. We deleted the sentence “However, their affinities are not without controversy as the lacking soft tissue.”

      L085: change "proxy"

      Yes, we changed to “Considering their polypoid shape and cubomedusa-type anatomy, the hatched olivooids appear to a type of periderm-bearing polyp-shaped medusa (Wang et al. 2020) (lines 86-88).”

      L092: "assist in feeding" has this been stated before? Citation needed, else this interpretation should primarily be in the discussion

      Yes, you are right. We cited the reference at the end of the mentioned sentence (lines 91-94).

      L095: Remove "It is suggested that"

      Thanks for your suggestions. We corrected it.

      L100: "Probably the..." here to the end belongs in the discussion and not introduction.

      Thanks for your suggestions. We corrected the sentences.

      L108: "an abapical"

      Thanks for your suggestions. We revised it in line 107.

      L112: "for some distance" be specific or remove

      Yes, we deleted “for some distance” in line 111.

      L133: I can't find a corresponding article to Zhang et al., 2022. Is this the correct reference?

      The article Zhang et al. 2022 (entitled “Effect of boundary layer on simulation of microbenthic fossils in coastal and shallow seas”.) was in press at the time when we first submitted this manuscript. We complemented the corresponding term in References with the doi (10.13745/j.esf.sf.2023.5.32), which may help readers to locate this article easier.

      L138: You can't be positive that your simulations "provide a good reproduction of the movement." You have attempted to reconstruct said movement, but the language here is overly firm - as is "pave a new way"

      Thanks for your suggestions. We corrected the corresponding description (lines 138-140) to make it more rigorous.

      L149: "No significant change" implies statistics were computed that are not presented here.

      The statistics were computed by using built-in function of Excel and presented in Table supplement 2 (deposited in figshare, https://doi.org/10.6084/m9.figshare.23282627.v2) rather than in manuscript. To be specific, the error computations are followed by the formula of relative error, which is defined by:

      where u_z denotes the velocity profile collected on each cut point z with the current mesh parameters, u_z^* denotes the velocity profile collected on each cut point z with the next finer mesh parameters, i denotes each time step (from 0.01 to 4.0). In this case, the total average error was computed by averaging the sum of each 〖error〗_i on corresponding time step. The results are red marked in Table supplement 2. We revised the corresponding description in lines 140-146

      L152: "line graphs" >> "profiles"

      Thanks for your suggestions. We corrected it in line 144.

      L159: remove "significant" unless statistics are being reported, in which case those need to be explained in detail.

      Thanks for your suggestions. We removed "significant" and corrected the corresponding sentences in lines 150-153 to make them more rigorous.

      L159: I would recommend including a supplemental somewhere that shows how tall the modeled Quadrapyrgites is and where the cut lines exist above it.

      Many thanks for your suggestions. Corresponding complementation was made in the last paragraph of section “Computational fluid dynamics” (line 455 and line 535). We agree that it is appropriate to elucidate the height of modeled Quadrapyrgites and the position of each cut point. Hence, we add a supplementary figure (entitled Figure supplement 1) to illustrate those above.

      L183: "The maximum vorticity magnitude was set..." I do not follow what this threshold is based on the current phrasing.

      The vorticity magnitude mentioned here is the visualisation range of the color scalebar, which can be set manually set in the software. The positive number represent the vortex rotated counterclockwise, while the negative number represent that rotated clockwise on the cut plane. In this case, the visualisation range is [-0.001,0.001] (i.e., the absolute value of 0.001 is the threshold), as the color scalebar in Figure 7. Decreasing the threshold, for example, setting the visualisation range to [-0.0001,0.0001], can capture smaller vorticity on the cut plane, as the figure below on the left. Otherwise, setting the range to [-0.01,0.01] will focus on bigger vorticity, as the figure below on the right. We found [-0.001,0.001] could be an appropriate parameter to visualize the vortex near periderm based on our trial. To be more rigorous and to avoid confusion, we modified the description in the corresponding place of the manuscript (lines 172-174).

      Author response image 1.

      L201: "3.9-4 s"

      Thanks, we corrected it in line 191.

      L269: "Sahin et al.,..." add to the next paragraph

      Yes, we rearranged the corresponding two paragraphs (lines 258-289).

      L344: "Higher expansion-contraction..." this needs references and/or more justification.

      Thanks. We deleted the sentence.

      L446: two layers of hexahedral elements is a very low number for meshing boundary layer flow

      Many thanks for your question. We agree that an appropriate hexahedral elements mesh for boundary layer is essential to recover boundary flow, especially in cases where turbulence model incorporated with wall function is adopted such as the standard k-epsilon model. In this case, the boundary flow is not the main point since the velocity profile was collected above periderm aperture rather than near no-slip wall region. What else, we do not need drag (related to sheer stress and pressure difference) computations in this case, which requires a more accurate flow velocity reconstruction near no-slip walls as what previous palaeontological CFD simulations have done. Thus, we think two layers of hexahedral elements are enough. What else, hexahedral elements added to periderm aperture domain, as illustrated in figure below, can let the velocity near wall vary smoothly and thus can benefit the convergency of simulations.

      Author response image 2.

      L449: similar to comments regarding lines 146-148, key information is missing here. Figure 3C appears to be COMSOL's default meshing routine. While it is true that the domain is discretized in a non-uniform manner, no information is provided as to what mesh parameters were "tuned" to determine "optimal settings" or what those settings are (or how they are optimal).

      Many thanks for your question. Specific mesh parameters were listed in Table supplement 3 and corresponding descriptions and modifications were made both in lines 475-479 and lines 542-549. In most CFD cases, the mesh parameters need to be tuned to ensure a balance between computational cost and accuracy. If the difference of the result obtained from present mesh and that obtained from the next finer mesh ranges from 5% -10%, the present mesh is expected to be “optimal”. To achieve this, we prescribed several sets of different mesh (mainly concerning maximum and minimum element size) to each subdomain (domain of the inner cavity, domain of the peridermal aperture and domain outside of fossil model) of the whole computational domain in the test model. Subsequently, we refined the mesh step by step as much as possible and adjust the element size of subdomains to find suitable mesh parameters, that is how the mesh parameters were "tuned". We agree that we should explicit what mesh parameters were tuned and what those settings are.

      Figure 7 should have the timesteps included and the scaling of the arrows should be explicit in the caption

      Many thanks for your suggestions. We intended to use the white arrows to represent the velocity orientation rather than true velocity scale in Figure 7 (Instead, the white arrows in Animation supplement 1 represent a normalized velocity profile). To avoid confusion, we revised Figure 7 with timesteps and arrows represent a normalized velocity profile, making it consistent with Animation supplement 1. Corresponding modification is also made in the caption of Figure 7.

      The COMSOL simulation files (raw data) are missing from the supplemental data. These should be posted to Dryad or here.

      We uploaded the files to Dryad (https://datadryad.org/stash/share/QGDSqLh8HOll7ofl6JWVrqM57Rp62ZPjvZU0AQQHwTY), and added the corresponding link to section “Data Availability Statement”.

      REVIEWER 2

      Lines 319-334: The omission in this paragraph of Paraconularia ediacara Leme, Van Iten and Simoes (2022) from the terminal Ediacaran of Brazil is a serious matter, as (1) the medusozoan affinities of this fossil are every bit as well established as those of anabaritids, Sphenothallus, Cambrorhytium and Byronia, and (2) P. ediacara was a large (centimetric) polyp, the presence of which in Precambrian times is thus a problem for the simple evolutionary scenario (very small polyps followed later in evolutionary history by large polyps) outlined in the paragraph. Thus, Paraconularia ediacara must be mentioned in this paper, both in connection with the early evolution of size in cnidarian polyps and in other places where the early evolution of cnidarians is discussed.

      Thanks for your important suggestions. We added some sentences in lines 323-326 as following: “Significantly, the large-bodied, skeletonized conulariids-like Paraconularia found from the terminal Ediacaran Tamengo Formation of Brazil confirmed their ancient predators like the extant medusozoans and suggested the origin of cnidarians even farther into the deep evolutionary scenario (Leme et al. 2022).”

      Line 23. Delete the word, been.

      Line 25. Replace conjecture with conjectural.

      Line 26. Delete the word, the before calyx-like.

      Line 32. Replace consisting with consistent.

      Thanks for your suggestions. We all corrected them.

    1. Author Response

      The following is the authors’ response to the original reviews.

      Thanks for your comments and suggestions concerning our manuscript entitled “miR-252 targeting temperature receptor CcTRPM to mediate the transition from summer-form to winter-form of Cacopsylla chinensis”. These comments are all of great important and extremely helpful for revising and improving our manuscript. We have revised the manuscript carefully according to all your comments. Our point-by-point responses to the comments are listed below.

      Reviewer #1 (Recommendations For The Authors):

      1) If the authors wish to improve their phylogenetic analysis, I strongly suggest using their hemipteran sequences alongside the Drosophila homolog and at least all of the human paralogs. This should be generally sufficient to recapitulate the generally accepted TRPM phylogeny. If the authors contend that this is in fact a separate lineage from other insect TRPMs, a phylogeny that is as taxonomically inclusive as possible, and as methodologically rigorous as possible, would be ideal.

      Thanks for your great suggestion. We have redid the phylogenetic analysis in Figure S1B using CcTRPM sequence with homologs from other 16 species, including 8 human paralogs, 1 Mus musculus homolog, 1 Drosophila homolog, and 6 insect homologs. The relative description was added in Line 489-491 and Line 1044-1049 of our revised manuscript.

      2) If the authors wish to conclude that this is a cold-sensitive ion channel, I strongly suggest repeating at least the Ca2+ imaging with a cold stimulus. In the absence of this experiment, I think that the conclusions need to be significantly softened/hedged, making it clear that the only evidence of cold sensitivity is indirect (resulting from the knockdown experiments).

      Thanks for your excellent suggestion. We have performed Ca2+ imaging with a cold stimulus of 10°C. As expected, there was a clear increase of Ca2+ concentration was observed when treated with cold stimulus of 10°C, which was similar with menthol treatment. So, we could get the solid conclusion that CcTRPM is a direct cold-sensitive ion channel in C. chinensis. We also have added the Ca2+ imaging result with a cold stimulus of 10°C in Figure 2D and moved the results of Ca2+ imaging with menthol treatment to Figure S2I. The related results and methods were added in Line 193-200, Line 919-923, and Line 1065-1069 of our revised manuscript.

      3) Lines 173 and 181: The method used to identify the putative transmembrane domains was not described (although the 3D model does have the correct TRP structure, these methodological details would be appreciated).

      Thanks for your great suggestion. We used an online software of SMART (a Simple Modular Architecture Research Tool) to identify the putative transmembrane domains of CcTRPM, and have added these methodological details in Line 485-487 of Materials and Methods of our revised manuscript.

      4) Lines 176-178: The authors state that "phylogenetic analysis revealed that CcTRPM was most closely related to the DcTRPM homologue (Diaphorina citri, XP_017299512.2), which was consistent with the evolutionary relationships predicted from the multiple alignment of amino acid sequences." The meaning of this sentence is unclear to me. I'm not sure what it means to be "consistent with the evolutionary relationships predicted from the multiple alignment of amino acid sequences."

      Thanks for your excellent suggestion. We have revised this sentence in Line176 to 179 of our revised manuscript.

      5) Lines 474-475: The authors state that the NCBI database was used to identify homologous sequences, but there isn't sufficient methodological detail to repeat the search. For example, was this a BLASTP search? Was it taxonomically restricted? What statistical thresholds for homology inference were used? These details would be much appreciated.

      Thanks for your great suggestion. We used BLASTP of NCBI database to identify homologous sequences and preferred the representative species that TRPM sequences have been reported. We have added more description about the methodological detail of phylogenetic analysis in Line 489 to 491 of our revised manuscript.

      6) It would be very interesting, but not critical, to know if menthol and borneol alone have an effect on cuticle thickness.

      Thanks for your excellent suggestion. Actually, we performed the experiments of menthol and borneol alone on cuticle thickness at the beginning. Under 25°C condition, treatment of menthol and borneol alone induced 30-40% transition of 1st instar nymphs from summer-form to winter-form, but only had some slight effect on cuticle thickness, not strong as 10°C of low temperature, because of the opposite effect of 25°C. However, under 10°C condition, we could not know whether the effect on cuticle thickness is from 10°C of low temperature, or direct from menthol and borneol alone.

      7) It would be interesting, but not critical, to confirm the authors' ab initio protein folding by comparing their model to the AlphaFold2-derived model, either by folding it themselves or extracting it from the AlphaFold Protein Structure Database, if it has already been folded by DeepMind.

      Thanks for your great suggestion. We have predicted the tertiary protein structures of CcTRPM with AlphaFold2 software and the result was shown in Author response image 1. Compared with the result in Figure 2A, the conserved ankyrin repeats (ANK) and six transmembrane domains were almost similar.

      Author response image 1.

      The tertiary structures of CcTRPM predicted with AlphaFold2 software.

      8) Figures 1F-G, 3F, 4A-B, 5G-J, S6C, and S7C-D do not plot replicates (although these are plotted in other figures).

      Thanks for your excellent suggestion. Besides Figure 1F-G was stacked grouped graph type and could not add the plot replicates, we have added the plot replicates in Figures 3F, 4A-B, 5G-J, S6C, and S7C-D of our revised manuscript.

      9) Figure 5A-C, and associated text: The significance of these findings is somewhat lost on me, coming from a position of general naivety concerning chitin biosynthesis. My interpretation of Figure 5A was that each of these steps was a necessary component of chitin biosynthesis. It was thus surprising that not all of the steps were required. I think it would be exceptionally helpful if the authors spent more time describing this pathway, alternative pathways to generating the intermediate steps, and ultimately, their hypothesis of why only two steps seem critical.

      Thanks for your great suggestion. The signal pathway of chitin biosynthesis in Figure 5A was modified from the paper of Doucet and Retnakaran, 2012. De novo biosynthesis of chitin has eight enzymatic steps, including 1 Trehalose, 2 enzymes in Glycolysis, 4 enzymes in Hexosamine pathway, and 1 Chitin synthesis. Glycolysis and hexosamine pathway are two complex cellular metabolic processes within organisms. We supposed that there are two reasons for not all of these steps were required: (1) the function of some enzymes may be replaced or supplemented by other enzymes, for examples, function of hexokinase and glucokinase was similar. (2) The reason for no obviously phenotypic defects might be cause by insufficient interference efficiency of RNAi. So, it’s worth to further study the functions of these chitin biosynthesis enzymes by CRISPR-Cas9 in future. We have added more describing about this chitin biosynthesis pathway in Line 379-390 of our revised manuscript.

      Reviewer #2 (Recommendations For The Authors):

      1) Line 19, should be morphological transition.

      Thanks for your excellent suggestion. We have changed “behavioral transition” to “morphological transition” in Line 19 of our revised manuscript.

      2) Line 21, delete the novel.

      Thanks for your excellent suggestion. We have deleted the word of “novel” in Line 21 of our revised manuscript.

      3) Fig. 2B, did authors examine the CcTRPM expression level before 3 d? Given that CcTRPM acts as a cold sensor, it is supposed to respond to temperature change quickly.

      Thanks for your excellent suggestion. We have examined the CcTRPM expression level in 1 d and 2 d after 10°C treatment compared with 25°C treatment. As expected, CcTRPM expression levels were also obviously increased in 1 d and 2 d after 10°C treatment. We have added the relative results in Figure S2F and relative description in Line 184-185, Line 500, and Line 1059-1060 of our revised manuscript.

      4) Fig. 2I, from the figure legend and the text in the panel, it's hard for readers to understand what the authors intend to say. This data is important since knockdown of CcTRPM decreases the winter-form from 90% to 30% at 10℃. Provide more information in the figure legend.

      Thanks for your excellent suggestion. We have added more information in the figure legend of Figure 2I in Line 933-939 of our revised manuscript.

      5) Line 224, ...CcTRPM functions as a molecular switch to modulate the transition from .... The phrase 'molecular switch' is inappropriate because knockdown of CcTRPM partially decreases the form ratio as shown in Fig.2I instead of reversing the effect completely. So, use other words instead of 'molecular switch'.

      Thanks for your excellent suggestion. We have changed “a molecular switch” to “an essential molecular signal” in Line 225 of our revised manuscript.

      6) Fig. 4G, this data is important. It's nice to see that this data is provided.

      Thanks for your excellent suggestion. We have provided the data of Figure 4G in Table S2 of our revised manuscript.

      7) Authors showed that CcTRPM functions as a cold receptor to regulate the transition of C. chinensis from summer-form to winter-form. Does this mean that a heat receptor gene functions oppositely by transiting winter-form into summer-form? Did the authors test the function of a heat TRP in the form transition? At least, discuss this in the discussion part.

      Thanks for your excellent suggestion. TRPV ion channel has been reported to function as a heat receptor in mammals by David Julius (Caterina et al., 1997; Cao et al., 2013). So, we supposed TRPV maybe function as a heat receptor to induce the transition from winter-form to summer-form in C. chinensis. The relative tests are on going. We have added two references in Line 681-686 and some discussion about the heat receptor in Line 341-345 of our revised manuscript.

      8) Line 433, which tissue was used for transmission electron microscopy?

      Thanks for your excellent suggestion. The thorax was used for transmission electron microscopy, and we have added the information in Line 448 and Line 453 of our revised manuscript.

      9) How is the conservation of miR-252? Does the regulatory role of CcTRPM and miR-252 apply to the psylla family in addition to C. chinensis?

      Thanks for your excellent suggestion. Besides C. chinensis, the phenomenon of summer-form and winter-form also existed in other psylla species, like Cyamophila willieti. Because of no genomic information was reported in most psylla species, we could not evaluate the conservation of miR-252 between different psylla species. However, it is worth and interesting to clarify whether the function of TRPM and miR-252 were conserved in the future.

    1. Author response:

      The following is the authors’ response to the original reviews.

      eLife assessment

      This is a valuable study in which the authors provide an expression profile of the human blood fluke, Schistosoma mansoni. A strength of this solid study is in its inclusion of in situ hybridisation to validate the predictions of the transcript analysis.

      We thank the reviewers and the editor for their effort and expertise in reviewing our manuscript. We have made changes based on the reviews and believe this has greatly strengthened our manuscript. We appreciate their insightful comments and suggestions.

      Public Reviews:

      Reviewer #1 (Public Review):

      In this work, the authors provide a valuable transcriptomic resource for the intermediate free-living transmission stage (miracidium larva) of the blood fluke. The single-cell transcriptome inventory is beautifully supplemented with in situ hybridization, providing spatial information and absolute cell numbers for many of the recovered transcriptomic states. The identification of sex-specific transcriptomic states within the populations of stem cells was particularly unexpected. The work comprises a rich resource to complement the biology of this complex system, however falls short in some technical aspects of the bioinformatic analyses of the generated sequence data.

      (1) Four sequencing libraries were generated and then merged for analysis, however, the authors fail to document any parameters that would indicate that the clustering does not suffer from any batch effects.

      We thank the reviewer for this comment which has given us the opportunity to elaborate on this interesting point. Consequently, we have added evidence to show that the data do not suffer from batch effects between samples (e.g. between sorted samples 1 and 4, and unsorted samples 2 and 3). We now show that there are contributions to all clusters from sorted and unsorted samples and highlight the benefits to using both conditions in a cell atlas with unknown cell types.

      Accordingly, we have now added the following paragraph to line 153:

      There were contributions from sorted and unsorted samples in almost all clusters (except ciliary plates). We found that some cell/tissue types had similar recovery from both methods (e.g. Stem A, Muscle 2, and Tegument), others were preferentially recovered by sorting (e.g Neuron 1, Neuron 4, and Stem E), and some were depleted by sorting (e.g. Parenchyma 1, Protonephridia, and Ciliary plates) (Supplementary Figure 1) , Supplementary Table 4). This variation in recovery, therefore, enabled us to maximise the discovery and inclusion of different cell types in the atlas.

      We have now added a Supplementary Figure 1 showing the contribution of sorted and unsorted cells to the Seurat clusters. We have also included a Supplementary Table 4 detailing the cell number contribution for both conditions and the percentages in order to easily compare differential recovery between cell types.

      These are added to the manuscript.

      (2) Additionally, the authors switch between analysis platforms without a clear motivation or explanation of what the fundamental differences between these platforms are. While in theory, any biologically robust observation should be recoverable from any permutation of analysis parameters, it has been recently documented that the two popular analysis platforms (Seurat - R and scanPy python) indeed do things slightly differently and can give different results (https://www.biorxiv.org/content/10.1101/2024.04.04.588111v1). For this reason, I don't think that one can claim that Seurat fails to find clusters resolved by SAM without running a similar pipeline on the cluster alone as was done with SAM/scanPy here. The manuscript itself needs to be checked carefully for misleading statements in this regard.

      We thank the reviewer for this comment and agree that it’s important to increase the clarity on this matter. We have added additional detail to explain that results of subclustering Neuron 1 using Seurat and SAM/ScanPy were broadly similar, but that we presented the results from the SAM/ScanPy analysis due to the strengths of SAM in detecting small differences in gene expression (Tarashanky et al., 2019 PMID: 31524596). We have included here the UMAP showing subclustering of Neuron 1 in Seurat for comparison.

      Author response image 1.

      UMAP showing subclustering of Neuron 1 cluster in Seurat (SCT normalisation, PC = 19, resolution = 0.3).

      We’ve added this additional text to the ‘Neuron abundance and diversity’ section on line 220:

      We explored whether Neuron 1 could be further subdivided into transcriptionally distinct cells by subclustering (Supplementary Figure 2; Supplementary Table 6) using the self-assembling manifold (SAM) algorithm (Tarashansky et al., 2019) with ScanPy (Wolf et al., 2018), given its reported strength in discerning subtle variation in gene expression (Tarashansky et al., 2019), although a similar topology was subsequently found using Seurat.

      (3) Similarly, the manuscript contains many statements regarding clusters being 'connected to', or forming a 'bridge' on the UMAP projection. One must be very careful about these types of statements, as the relative position of cells on a reduced-dimension cell map can be misleading (see Chari and Pachter 2023). To support these types of interpretations, the authors should provide evidence of gene expression transitions that support connectivity as well as stability estimates of such connections under different parameter conditions. Otherwise, these descriptors hold little value and should be dropped and the transcriptomic states simply defined as clusters with no reference to their positions on the UMAP.

      We thank the reviewer for this thoughtful comment. We agree and have rephrased those statements accordingly e.g. line numbers 218, 439, 543, and 557.

      (4) The underlying support for the clusters as transcriptomically unique identities is not well supported by the dot plots provided. The authors used very permissive parameters to generate marker lists, which hampers the identification of highly specific marker genes. This permissive approach can allow for extensive lists of upregulated genes for input into STRING/GO analyses, this is less useful for evaluating the robustness of the cluster states. Running the Seurat::FindAllMarkers with more stringent parameters would give a more selective set of genes to display and thereby increase the confidence in the reader as to the validity of profiles selected as being transcriptomically unique.

      The Reviewer is correct in noting that we used a permissive approach to enable a better understanding of the biology of each cluster, based on analysing enriched functions. However, we disagree about the suitability of the approach for finding markers. First, the permissive approach produced longer candidate lists, but those with the best AUC scores for each cluster are at the top of the list for each cluster. Second, some of the markers with lower expression also revealed interesting biology (e.g. Notum in the muscles). Furthermore, we used filtering on the marker genes lists to increase the minimum marker gene scores for analyses such as the GO analyses (details in the GO section of the methods). It’s important to stress that our approach also utilised validation by FISH for top marker genes, as well as biologically informative genes that were lower down the marker gene list.

      (5) Figure 5B shows a UMAP representation of cell positions with a statement that the clustering disappears. As a visual representation of this phenomenon, the UMAP is a very good tool, however, to make this statement you need to re-cluster your data after the removal of this gene set and demonstrate that the data no longer clusters into A/B and C/D.

      We’ve added Supplementary Figure 13 to show that after removing WSR and ZSR genes and reclustering, the data no longer clusters in A/B and C/D, even at a higher resolution where clusters appear oversplit.

      Also, as a reader, these data beg the question: which genes are removed here? Is there an over-representation of any specific 'types' of genes that could lead to any hypotheses of the function? Perhaps the STRING/GO analyses of this gene set could be informative.

      We have performed GO-enrichment analyses on W-specific genes, Z-specific genes and both together compared to the rest of the genome, but we did not find very informative results (see Supplementary Table 13 that we have now added, line 464). This may be due to the large difference in size. There are approx 900 Z-specific genes (males two copy, females one copy), while approx 30 W-specific genes many of which have homologs in the Z-specific region of the genome. Instead we suggest that tissue-specific regulation of gene dosage compensation is the more likely explanation as reported for other species (Valsecchi et al. 2018).

      (6) How do the proportions of cell types characterized via in situ here compare to the relative proportions of clusters obtained? It does not correspond to the percentages of the clusters captured (although this should be quantified in a similar manner in order to make this comparison direct: 10,686/20,478 = ~50% vs. 7%), how do you interpret this discrepancy? While this is mentioned in the discussion, there is no sufficient postulation as to why you have an overabundance of the stem cells compared to their presence in the tissue. While it is true that you could have a negative selection of some cell types, for example as stated the size of the penetration glands exceeds both that of the 10x capabilities (40uM), and the 30uM filters used in the protocol, this does not really address why over half of the captured cells represent 'stem cells'. A more realistic interpretation would be biological rather than merely technical. For example, while the composition of the muscle cells and the number of muscle transcriptomes captured are quite congruent at ~20%, the organism is composed of more than 50% of neurons, but only 15% of the transcriptomic states are assigned to neuronal. Could it be that a large fraction of the stem cells are actually neural progenitors? Are there other large inconsistencies between the cluster sizes and the fraction of expected cells? Could you look specifically at early transcription factors that are found in the neurons (or other cell types) within the various stem cell populations to help further refine the precursor/cell type relationships?

      Yes, it is really interesting that more than 50% of cells in the animal are neurons whereas more than 50% of cells in scRNAseq data are stem cells. This dataset provides a unique opportunity to compare tissue composition in the whole animal to the corresponding single cell RNAseq dataset.

      The table (in Supplementary Table 17) shows the percentage of cells from each tissue type in the miracidium (identified via in situ hybridisation of tissue-type marker genes) and in the scRNAseq to understand this phenomenon.

      This table shows that the single cell protocol used in this study negatively selected for nerves and tegument, and positively selected for stem and parenchyma. The composition of the muscle and protonephridia cells and the number of muscle and protonephridia transcriptomes captured are quite congruent.

      This technical finding is also biologically consistent. For instance, the tegument cells span the body wall muscles, with the cell bodies below and a syncytial layer above. It is not known how the tegument fragments during the dissociation process, and which parts of the cells get packaged by the 10X GEMs. Because of tegumental structure, the cells are likely prone to damage, and therefore we speculate that is why the tegument cells are under-represented in our 10X data. Unusually shaped fragments may not have been captured in 10X GEMs and of those that were, damaged or distressed tegument cells/fragments may have been excluded post-sequencing, by QC filters including cell calling, mitochondrial percentage and low transcript count (e.g. if there there was a tegumental fragment with 100 transcripts it would have not passed QC). Stem cells are spherical with a large nucleus:cytoplasm ratio, likely making them more robust during dissociation and more likely to be captured in 10X GEMs.

      We don’t think that a large fraction of the stem cells are actually neural progenitors because:

      (1) we used previously reported marker genes of different tissue types to identify the single cell RNAseq clusters, e.g. Ago2-1 for stem cells, which has been used in multiple life stages.

      (2) The stem cell transcriptomes express many previously reported stem cell marker genes.

      (3) We found that the stem cells from the single cell data generally had higher numbers of transcripts than the other cell types which is consistent with the Wang et al. 2013 observation that RNA marker POPO-1 could distinguish germinal (stem) cells from other cell types as they are RNA rich.

      (4) We also found higher numbers of ribosomal related transcripts in our stem cell transcriptomes, which is consistent with Pan’s observation that part of the distinct morphology of stem cells is densely packed ribosomes in the cytoplasm.

      In order to elaborate on this discussion we have generated new visualisations:

      (1) A UMAP of the stem cell marker ago2-1 (Supplementary figure 10), to further illustrate our evidence in classifying the stem cell clusters

      (2) A co-expression plot of the stem cell marker ago2-1 with neural marker complexin to confirm that there is little coexpression (the most coexpression being in Neuron 1 and Stem F). We identified that 15.56% of cells in the Stem F cluster show some expression of complexin (neural marker), suggesting that a small fraction of Stem F may be early/precursor neurons, but the gene expression indicates that the majority of cells in Stem F are more likely to be stem cells than any other tissue type. There is little to no complexin expression in the other stem clusters.

      (3) Expression plots of the 5 neurogenins (TFs involved in neuronal differentiation) we could identify using WormBase ParaSite in these data. Four of the five showed very little expression, and not in specific clusters. The fifth (Smp_072470) showed slightly more expression, though still sparse, mostly across the stem and neural clusters not enough to indicate that any of the stem clusters are neural progenitors.

      Author response image 2.

      Coexpression UMAP showing the expression of stem cell marker Ago2-1 and neural marker complexin.

      Author response image 3.

      UMAPs showing the expression five putative neurogenins of S.mansoni.

      Reviewer #2 (Public Review):

      Summary:

      In this manuscript the authors have generated a single-cell atlas of the miracidium, the first free-living stage of an important human parasite, Schistosoma mansoni. Miracidia develop from eggs produced in the mammalian (human) host and are released into freshwater, where they can infect the parasite's intermediate snail host to continue the life cycle. This study adds to the growing single-cell resources that have already been generated for other life-cycle stages and, thus, provides a useful resource for the field.

      Strengths:

      Beyond generating lists of genes that are differentially expressed in different cell types, the authors validated many of the cluster-defining genes using in situ hybridization chain reaction. In addition to providing the field with markers for many of the cell types in the parasite at this stage, the authors use these markers to count the total number of various cell types in the organism. Because the authors realized that their cell isolation protocols were biasing the cell types they were sequencing, they applied a second method to help them recover additional cell types.

      Schistosomes have ZW sex chromosomes and the authors make the interesting observation that the stem cells at this stage are already expressing sex (i.e. W)-specific genes.

      Weaknesses:

      The sample sizes upon which the in situ hybridization results and cell counts are based are either not stated (in most cases) or are very small (n=3). This lack of clarity about biological replicates and sample sizes makes it difficult for the reader to assess the robustness of the results and the extremely small sample sizes (when provided) are a missed opportunity to explore the variability of the system, or lack thereof.

      We have now added more details about the methods we used for validating cell type marker genes by in situ hybridisation. We have added to the methods that ‘We carried out at least three in situ hybridisation experiments for each marker gene we validated (each experiment was a biological replicate). From each experiment we imaged (by confocal microscopy) at least 10 miracidia (technical replicates) per marker gene experiment.’ on line 1036.

      In the figure legends we have added the number of miracidia that were screened, and documented the percentage of the screened larvae that showed the in situ gene expression pattern that is seen in the images in the figures, and that we described in the text.

      We manually segmented the nuclei of pan tissue marker genes, and we did this for one miracidium in the case of all tissues, except stem cells where we segmented stem cells in five larvae. Manual segmentation of gene expression in a confocal z-stack is very time consuming. We consider that the variability of different cell and tissue types (stereotypy) between miracidia is beyond the scope of this paper and can be investigated in future work.

      Although assigning transcripts to a given cell type is usually straightforward via in situ experiments, the authors fail to consider the potential difficulty of assigning the appropriate nuclei to cells with long cytoplasmic extensions, like neurons. In the absence of multiple markers and a better understanding of the nervous system, it seems likely that the authors have overestimated the number of neurons and misassigned other cell types based on their proximity to neural projections.

      This is a valid point, and we acknowledge the difficulties of assigning a nucleus to a cell using mRNA expression only and in the absence of a cell membrane marker. We tried to address this issue by labelling the cell membranes using an antibody against beta catenin after the HCR in situ protocol. This method has been used successfully on sections on slides (Schulte et al., 2024), but we failed to get usable results in our miracidia whole-mounts. The beta catenin localisation marked the membranes of the gland cells but didn’t do the same for the neurons or other cell types (see image below).

      Author response image 4.

      Image showing a maximum intensity projection of a subvolume of a confocal z-stack of a miracidia wholemount in situ hybridisation (by HCR) for paramyosin counterstained with a beta catenin antibody (1:600 concentration of Sigma C2206). The cell membrane of a lateral gland is clearly labelled, but those of the neurons of the brain and the paramyosin+ muscle cells are not.

      Our observation that 57% of the cells in a miracidium are nerves is high compared to the C.elegans hermaphrodite adult in which 302 out of 959 cells are neurons (Hobert et al., 2016), few studies have equivalent data with which to make comparisons. Despite this, and the limitation described above, we believe that we have not overestimated the number of neural cells. During the process of validating the marker genes and closely examining gene expression in hundreds of miracidia, we noted that the nuclei of different tissue types are distinct and recognisable (see figure below). The nuclei of stem, tegument and parenchymal cells are comparatively large and spherical with obvious nucleoli (i). The four nuclei of the apical gland cell are angular, pentagonal in shape and sit adjoining each other (inside red dashed circle, i-iii), those of the two lateral glands are bilaterally symmetrical and surrounded by flask shaped cytoplasm (arrows, iv). The nuclei of the body wall muscle cells are peripheral and flattened on the outer edge (iii). The notum+ muscle cell nuclei are anterior of the apical gland (manuscript Figure 2E). The only other two tissue types are the nerves and protonephridia, and their nuclei are smaller and more compact/condensed. In situ expression of the protonephridia marker suggests that 6 cells make up the protonephridial system (manuscript Figure 4 B&E). Therefore, by process of elimination, the remaining nuclei should belong to neurons. The complexin expression pattern supports this and we counted 209 nuclei that were surrounded by cpx transcript expression. To help the reader interpret this for themselves we have added confocal z-stacks of miracidia where tissue level markers have been multiplexed (supplementary videos 18-20). We counted all tissue type cells individually and the tissue type cell numbers added up to the overall cell count.

      Author response image 5.

      Image showing the diversity of nucleus morphology between tissue types in the miracidium.

      Biologically, it is not surprising that this larva is dominated by neural cells. It must navigate a complex aquatic environment and identify a suitable mollusc host in less than 12 hours. It is a non-feeding vehicle that must deliver the stem cells to a suitable environment where they can develop into the subsequent life cycle stage. Accordingly, the cell type composition reflects this challenge.

      The conclusion that germline genes are expressed in the miracidia stem cells seems greatly overstated in the absence of any follow-up validation. The expression scales for genes like eled and boule are more than 3 orders of magnitude smaller than those used for any of the robustly expressed genes presented throughout the paper. These scales are undefined, so it isn't entirely clear what they represent, but neither of these genes is detected at levels remotely high (or statistically significant) enough to survive filters for cluster-defining genes.

      Given that germ cells often develop early in embryogenesis and arrest the cell cycle until later in development, and that these transcripts reveal no unspliced forms, it seems plausible that the authors are detecting some maternally supplied transcripts that have yet to be completely degraded.

      We agree that the expression of genes such as eled and boule are low. We made this clear in the figure legends and text, and have now added scale information to the figure legends. We did not explore these genes as cluster-defining genes, partly due to their comparatively low levels of expression, but as genes already reported to be important in germ line specification. We found the expression of these genes to be consistent with our hypothesis that the Kappa stem cells may include germ line segregated cells, but our hypothesis does not rest on these lower-expressed genes.

      It is certainly possible that we have detected some maternally supplied transcripts in the miracidia stem cells. However experiments to distinguish between zygotic and maternal transcripts using metabolic labelling of zygotic transcripts (e.g. Fishman et al. 2023) would be hard in this species due to the hard egg capsule and its ectolethical embryogenesis. Therefore this is out of scope for this work, but this would be a very interesting topic to follow up on and develop tools for.

      We have added these sentences to the Discussion ln 746 ‘Intriguingly, the presence of spliced-only copies of the germline defining genes eled and boule could suggest that they are maternal transcripts that have been restricted to the primordial germ cells during embryogenesis, as is the case in Zebrafish embryos (Fishman et al., 2023). An alternative explanation is that unspliced transcripts exist for these lowly expressed genes but their abundance was below our threshold for detection.’

      Reviewer #1 (Recommendations For The Authors):

      Ln 138: specify the version of Seurat used, and reference the primary papers for this software. Also, from the dot plot shown here, these do not all appear to be supported by unique gene sets. How was the final clustering determined? This information is in the methods section, but a summary here could make it more robust for the readership.

      In addition to the details in the methods section, we have added the version and referenced the version-specific primary paper for Seurat when it is first mentioned. We have also summarised the methods used to select the final clustering when we first present the results to aid in clarity.

      We added to line 140 ‘Using Seurat (version 4.3.0) (Hao et al., 2021), 19 distinct clusters of cells were identified, along with putative marker genes best able to discriminate between the populations (Figure 1C & D and Supplementary Table 2 and 3). We used Seurat’s JackStraw and ElbowPlot, along with molecular cross-validation to select the number of principal components, and Seurat’s clustree to select a resolution where clusters were stable (Hao et al., 2021).’

      Ln 147: isn't seven stem cell clusters a lot? See comment in public review.

      We did not have preconceived expectations of the number of stem cell clusters, and were guided by the data and gene expression. In doing so we also discovered that four of those clusters were likely only two ‘biologically or functionally distinct’ clusters, but these split into four clusters based on the expression of genes on the sex-specific regions of the chromosomes, which was both unexpected and interesting.

      Figure 1D: gene model names are un-informative for the general reader. Can you provide any putative gene identities here to render this plot interpretable? For example in the main text you state that Smp-085540 is paramyosin; please use this annotation in all your visual material (as is used in Figure 2A).

      We have added gene names to the dotplots in all figures with the locus identifier (minus the ‘Smp’ prefix) in brackets after the gene name.

      Ln 191:196 Identification of the two muscle clusters as circular and longitudinal muscles is very well supported. However, it would be interesting to look specifically at the genes that are different here. Did the authors attempt to specifically pull out genes differentially expressed between these two groups, or only examine the output of FindAllMarkers at this point?

      We did indeed look specifically for genes differentially expressed between the muscle clusters, the results of which can be found in Supplementary Table 5 (Line 206). This analysis revealed “Wnt-11-1 (circular) and MyoD (longitudinal) were among the most differentially expressed genes”, which were important findings in our understanding of the muscle cells in the miracidium.

      Ln 207: "connected to stem F" - does this refer specifically to their relative positions on the UMAP in Figure 1C? One must be very careful about these types of statements, as the relative position of cells on a reduced-dimension cell map can be misleading (public review).

      We agree, and have rephrased accordingly.

      Ln 209:211: Here the authors switch from Seurat (R) as an analysis package, to SAM (python) for subset analysis of one large neural cluster. The results indicate that there may be small populations of transcriptomically distinct neural subtypes also within the neural1 cluster, but that the vast majority of these cells do not express unique transcriptomic profiles. Also in the supplementary material for this (SF1) there is a question of whether or not there is any clustering according to batch effects.

      In general, I find the neuronal section a little difficult to follow and it is unclear how many unique profiles are present and which are documented with in situ. I would recommend re-running the analysis on the entire neural subset (n1:5: complexin positive) and generating an inventory of putatively unique neural states with the associated in situ validation altogether in a main figure.

      In response to comments above we have both clarified our reasoning for using SAM analysis, and presented more details on possible batch effects. We have gone through the neural system results in order to make it clearer for the reader to follow.

      Ln 236: here the authors introduce a STRING analysis for the first time. Also, this method requires some introduction for the general audience in terms of its goals and general functionality and output.

      We used STRING analysis on some well defined clusters to provide additional clues about function. At the first mention of STRING (neuron 3 results) we have added the following statement to give more introduction to the reader: “STRING analysis of the top 100 markers of Neuron 3 predicted two protein interaction networks with functional enrichment: ….”

      Ln. 280:281. It is unclear why Steger et al is referenced here. In what way does a description of neural and glandular cell transcriptomic similarity in a Cnidarian inform your data on a member of the playhelmenthes? (which should also be referenced in the introduction: to which phylogenetic lineage does Schistosoma belong).

      We have now added that the Schistosoma belong to the Platyhelminths on the first line of the introduction.

      Ln 295 we have added ‘We expected to find a discrete cluster(s) for the penetration glands, and that it would show similarities to the neural clusters (as glandular cells arise from neuroglandular precursor cells in other animals, such as the sea anemone, Nematostella vectensis, Steger et al., 2022).’

      Ln 339: explain the motivation for generating a further plate-based scRNA of the ciliary plates.

      We wished to include the ciliary plates alongside the gland cells for plate based RNAseq as they are unique to the miracidium stage and wanted to make sure we had captured them in this study.

      Ln 345: Define the tegumental cells for the general reader.

      We have added further description on tegument cells in the introduction and tegument results section, e.g. on line 61, 366).

      Ln 365: "this cluster" is imprecise. Which cluster are we looking at here?' Also: were flame cells already described morphologically at this stage, or is this the first description of the protonephridial system for this stage of the life cycle?

      We have now clarified which cluster we are talking about in the text. The flame cells have been described using TEM before (Pan, 1980).

      Stem Cells: also here you refer to cells as 'bridge' which refers to the configuration of the UMAP. While this is likely a biological representation of a different differentiation state, the nomination of this based solely on the UMAP representation should be avoided.

      We have rephrased this.

      Figure 5B: What is neuron 6? This was Neuron 3 in Figure 1.

      Thank you for spotting these mistakes in the labelling, we have corrected them now.

      Ln 421:438 - Here you represent a UMAP representation of the cell positions, but state that the clustering disappears. See comment in Public Review.

      Modified accordingly, see response in public review.

      Ln 472 "Cells in stem E, F, and G in silico clusters might be stressed/damaged/dying cells or cells in transcriptionally transitional states." Is there any evidence supporting either of these conclusions?

      We found that 15.56% of the cells in Stem F expressed the neural marker complexin, leading us to consider the possibility that a fraction of these cells may be neural precursors. Stem F also had some cells with a mitochondrial % near the maximum threshold we set, suggesting they could be experiencing some stress. Since we could not identify clear markers for these clusters, their function and a more specific identity, beyond ‘stem’, is not yet known.

      That the two stem cell populations contribute to different parts of the next life cycle stage is interesting. The combined analysis suffers from the same issues as the previous analysis in terms of sample distribution; are the 'grey' sporocyst cells also contributing to the stem A/B (kappa) C/D (delta/phi) clusters? This is not possible to tell from the plot as the miracidia may simply be plotted on the top. A different representation of sample contribution to clusters is warranted.

      We have made an alternative visualisation here to demonstrate that the miracidia cells are not plotted on top of the sporocyst stem cells. Unfortunately this visual is hampered as there is not a straightforward way to split the panels. In the figure below, the left pane shows the miracidia cells, and the right pane shows the sporocyst cells. Below that, we have included the original figure for comparison. It can be clearly seen that there are three miracidia tegument cells in the sporocyst tegument cluster, and one sporocyst cell in the miracidia stem cells (Stem E), but the miracidia A/B and C/D stem cells are not plotted on top of any sporocyst cells.

      Author response image 6.

      Methods: Why is the multiplet rate estimate at >50% for the unsorted sample?

      We have added more detail on this: “The estimated doublet rate was calculated based on 10X loading guidelines and adjusted for our sample concentrations”.

      Reviewer #2 (Recommendations For The Authors):

      (1) The manuscript would benefit from a more careful consideration of what was already known based on previous literature, which would help the authors to better put their results in context. For example, previous work suggested that one of the sporocyst stem cell populations (phi) gives rise to tegument and other temporary larval structures; this appears not to be mentioned here. The model in Figure 7 suggests that two of the stem cell populations are gone at day 15 post-infection; the literature shows that those cells can still be detected at this stage (there are just far fewer of them).

      We have added the definition of Kappa, Delta and Phi as per Wang et al (2018) in the stem cell results p13 ln 428.

      We have amended Figure 7 to include further elements from the Wang et al (2018) paper that show that mother sporocyst stem cells classified as delta and phi are still detectable on day 15 post-infection in mother sporocysts.

      We intentionally didn’t put too much emphasis on fitting our data to the model of Wang et al (2018), because a) it’s a different life cycle stage and b) the single cell data the model was based on was from 35 stem cells and gathered using a different method, c) more recent data (Diaz, Attenborough et al. 2024) with 119 stem cells from sporocysts did not recover the same populations of stem cells. We therefore linked our data to previous literature where it was relevant but focused on being led by the data we gathered (>10,000 stem cells).

      (2) To add some detail to the public comment about the lack of clarity about sample sizes and biological replicates, and how this leads to questions about the robustness of the results, Figures 4 B and F show the expression pattern for the same parenchyma marker (Smp_318890) in two different samples. The patterns appear quite distinctive. In B, the cell bodies are so clearly labeled that the signal appears oversaturated. In F the cell bodies are barely apparent. Based on the single-cell clustering, it should be possible to distinguish between Parenchyma clusters 1 and 2 based on the levels of this transcript. Careful quantification of signal intensity from multiple samples across multiple experiments might enable the authors to detect such differences.

      The reason the expression patterns look different between panels 4Bii and 4F is that in 4Bii we have manually segmented the nuclei of the parenchymal cells in order to count them, whereas in the images in 4F there is no segmentation. We have made this more clear in this legend now, and also in the legends of Figures 2,3, and 5. If there was any signal intensity difference between parenchyma 1 and 2 cells based on expression of the marker gene, Smp_318890, it was not obvious. We carried out 6 experiments for parenchyma markers, multiplexing the pan-parenchyma marker, Smp_318890, with markers for parenchyma 2 but we were unable to distinguish between the two populations.

      (3) The authors find that the "somatic" stem cells in miracidia seem to combine attributes of the previously defined delta and phi stem cells from sporocysts. Because the 3 classes of sporocyst stem cells were defined by expression of nanos-2 and fgfrA, using those probes in in-situ experiments could have helped them resolve whether or not the miracidial cells represent precursors that can adopt either fate or if the heterogeneity is already present in miracidia.

      In silico expression of the marker genes for the 3 classes of sporocyst stem cells didn’t support those three classes in the miracidia stem cells (See supplementary table 10). We further subclustered the delta/phi cells to see if we could recover separate delta and phi populations but we were unable to do so. We therefore did not pursue in situ experiments of these genes. We instead prioritised cluster-defining genes in the miracidia stem cell populations rather than cluster defining genes in the sporocyst (defined by Wang et al., 2018), but we still explored these in silico. For example, instead of using klf to define Kappa (Wang et al 2018), we used UPPA to validate the Kappa population as it showed similar expression to klf but higher expression levels and was specific to that population. However, like Wang et al 2018, we did use p53, which is a cluster marker of delta and phi in sporocysts, as it showed clear and high expression in our miracidia delta/phi population. We were guided by our data and our knowledge of the literature. More in depth single cell RNAseq is needed from the mother and daughter sporocyst stages to understand the heterogeneity and fates of these stem populations.

      (4) Scale bars should be included throughout the figures and the scale should be defined either on the figure or in the legend. Similarly, all the scales used for velocity and expression analysis should be defined.

      We have added scale bars to all figures and legends.

      The statements “Gene expression has been log-normalised and scaled using Seurat(v. 4.3.0)”, “Gene expression has been normalised (CPM) and log-transformed using scvelo(v. 0.2.4)”, or “Library size was normalised and gene expression values were log-normalised using SAM (v1.0.1) and Scanpy (v1.8.2)” has been added to all figures as appropriate.

      (5) The table entitled In situ hybridization probes (Supplementary Table 15) contains no probe sequences, so any interested reader wishing to use these probes would have to design their own. To ensure the reproducibility of the results presented here, the authors should provide the probe sequences they used.

      In Supplementary Table 15 we have added the Molecular Instruments Lot number of all the probes used. Anyone wanting to repeat the experiment can order the same probes from the company.

      (6) It is unclear how useful the supplemental figures showing the STRING enrichment analyses will be for readers. Unannotated Smp gene identifiers provide no way to help readers digest the information in these hairballs. It would probably be best to replace the Smp names with useful annotations based on their orthologs; if not, these figures could probably be dropped entirely. (Also, the bottom panel of Supplementary Figure 7 has the word "Lorem" embedded on one of the connecting nodes.)

      “Lorem” has been removed.

      Many of the genes in these analyses do not have short descriptions, therefore we have used Smp gene identifiers in the STRING analysis supplementary figures. These ‘Smp_’ numbers can be used to search WormBase Parasite, where a description can be found and the history of the gene ID traced. This latter function facilitates searching for these genes in the literature and consistency between versions as gene models are updated.

      Minor edits

      (1) Figures 4A-D aren't cited in the text until after 4E-F are. It seems like moving the section on protonephridial cells (line 364) before the section on tegumental cells (line 345) better reflects the order of the figures.

      Thank you for flagging this, we have updated the in-text citations of Figure 4.

      (2) In-text references to Sarfati et al, 2021 should be to Nanes Sarfati, as listed in the references. Poteaux et al 2023 is cited in the text, but not in the reference list.

      Both of these have been fixed.

    1. Author response:

      The following is the authors’ response to the original reviews

      Public Reviews:

      Reviewer #1 (Public review):

      The authors track the motion of multiple consortia of Multicellular Magnetotactic Bacteria moving through an artificial network of pores and report a discovery of a simple strategy for such consortia to move fast through the network: an optimum drift speed is attained for consortia that swim a distance comparable to the pore size in the time it takes to align the with an external magnetic field. The authors rationalize their observations using dimensional analysis and numerical simulations. Finally, they argue that the proposed strategy could generalize to other species by demonstrating the positive correlation between the swimming speed and alignment time based on parameters derived from literature.

      Strengths:

      The underlying dimensional analysis and model convincingly rationalize the experimental observation of an optimal drift velocity: the optimum balances the competition between the trapping in pores at large magnetic fields and random pore exploration for weak magnetic fields.

      Weaknesses:

      The convex pore geometry studied here creates convex traps for cells, which I expect enhances their trapping. The more natural concave geometries, resulting from random packing of spheres, would create no such traps. In this case, whether a non-monotonic dependence of the drift velocity on the Scattering number would persist is unclear.

      We agree that convex walls increase the time that consortia remain trapped in pores at high magnetic fields. Since the non-monotonic behavior of the drift velocity with the Scattering number arises largely due to these long trapping times, we agree that experiments using concave pores are likely to show a peak drift velocity that is diminished or erased.

      However, we disagree that a random packing of spheres or similar particles provides an appropriate model for natural sediment, which is not composed exclusively of hard particles in a pure fluid. Pore geometry is also influenced by clogging. Biofilms growing within a network of convex pillars in two-dimensional microfluidic devices have been observed to connect neighboring pillars, thereby forming convex pores. Similar pore structures appear in simulations of biofilm growth between spherical particles in three dimensions. Moreover, the salt marsh sediment in which MMB live is more complex than simple sand grains, as cohesive organic particles are abundant. Experiments in microfluidic channels show that cohesive particles clog narrow passageways and form pores similar to those analyzed here. Thus, we expect convex pores to be present and even common in natural sediment where clogging plays a role.

      The concentration of convex pores in the experiments presented here is almost certainly much higher than in nature. Nonetheless, since magnetotactic bacteria continuously swim through the pore space, they are likely to regularly encounter such convexities. Efficient navigation of the pore space thus requires that magnetotactic bacteria be able to escape these traps. In the original version of this manuscript, this reasoning was reduced to only one or two sentences. That was a mistake, and we thank the reviewer for prompting us to expand on this point. As the reviewer notes, this reasoning is central to the analysis and should have been featured more prominently. In the final version, we will devote considerable space to this hypothesis and provide references to support the claims made above.

      The reviewer suggests that the generality of this work depends on our finding a ”positive correlation between the swimming speed and alignment [rate] based on parameters derived from literature.” We wish to emphasize that, in addition to predicting this correlation, our theory also predicts the function that describes it. The black line in Figure 3 is not fitted to the parameters found in the literature review; it is a pure prediction.

      Reviewer #2 (Public review):

      The authors have made microfluidic arrays of pores and obstacles with a complex shape and studied the swimming of multicellular magnetotactic bacteria through this system. They provide a comprehensive discussion of the relevant parameters of this system and identify one dimensionless parameter, which they call the scattering number and which depends on the swimming speed and magnetic moment of the bacteria as well as the magnetic field and the size of the pores, as the most relevant. They measure the effective speed through the array of pores and obstacles as a function of that parameter, both in their microfluidic experiments and in simulations, and find an optimal scattering number, which they estimate to reflect the parameters of the studied multicellular bacteria in their natural environment. They finally use this knowledge to compare different species to test the generality of this idea.

      Strengths:

      This is a beautiful experimental approach and the observation of an optimal scattering number (likely reflecting an optimal magnetic moment) is very convincing. The results here improve on similar previous work in two respects: On the one hand, the tracking of bacteria does not have the limitations of previous work, and on the other hand, the effective motility is quantified. Both features are enabled by choices of the experimental system: the use the multicellular bacteria which are larger than the usual single-celled magnetotactic bacteria and the design of the obstacle array which allows the quantification of transition rates due to the regular organization as well as the controlled release of bacteria into this array through a clever mechanism.

      Weaknesses:

      Some of the reported results are not as new as the authors suggest, specifically trapping by obstacles and the detrimental effect of a strong magnetic field have been reported before as has the hypothesis that the magnetic moment may be optimized for swimming in a sediment environment where there is a competition of directed swimming and trapping. Other than that, some of the key experimental choices on which the strength of the approach is based also come at a price and impose some limitations, namely the use of a non-culturable organism and the regular, somewhat unrealistic artificial obstacle array.

      In the “Recommendations for the Authors,” this reviewer drew our attention to a manuscript that absolutely should have been prominently cited. As the reviewer notes, our manuscript meaningfully expands upon this work. We are pleased to learn that the phenomena discussed here are more general than we initially understood. It was an oversight not to have found this paper earlier. The final version will better contextualize our work and give due credit to the authors. We sincerely appreciate the reviewer for bringing this work to our attention.

      We disagree that the use of non-culturable organisms and our unrealistic array should be considered serious weaknesses. While any methodological choice comes with trade-offs, we believe these choices best advance our aims. First, the goal of our research, both within and beyond this manuscript, is to understand the phenotypes of magnetotactic bacteria in nature. While using pure cultures enables many useful techniques, phenotypic traits may drift as strains undergo domestication. We therefore prioritize studying environmental enrichments.

      Clearly, an array of obstacles does not fully represent natural heterogeneity. However, using regular pore shapes allows us to average over enough consortium-wall collisions to enable a parameter-free comparison between theory and experiment. Conducting an analysis like this with randomly arranged obstacles would require averaging over an ensemble of random environments, which is practically challenging given the experimental constraints. Since we find good agreement between theory and experiment in simple geometries, we are now in a position to justify extending our theory to more realistic geometries. Additionally, we note that a microfluidic device composed of a random arrangement of obstacles would also be a poor representation of environmental heterogeneity, as pore shape and network topology differ between two and three dimensions.

      Recommendations for the Authors: 

      Reviewer #1 (Recommendations for the authors):

      My main suggestion is for the authors to describe the limitations of their approach in the case of concave pores.

      As we noted in our public comments, this was a very useful comment to hear from you and one that has been repeated as we have spoken about these results to colleagues. Convexities here represent an experimentally simple way to force bacteria to back track through the maze, as they must through natural sediment. We have greatly expanded this discussion to clarify this reasoning (lines 84–105). We provide references to three types of physical processes that may lead to such traps. First, as in figure 1 of Kurz et al, biofilm (white) can fill the spaces between convex pillars to create covexities. Additionally, clogging by cohesive particles can make narrow passageways between convex particles impassible. An example of clogging is shown in figure 6 of Dressaire & Sauret 2017. Finally, air bubbles trapped in the sediment can create pore-scale dead ends that require bacteria to backtrack. The full references are provided in the main text.

      Small points:

      (1) How many trajectories were used to produce Figures 2 b and c?

      We have modified the caption to note that these data represent the measured transition rates of a total 938 consortia at various Scattering numbers. Each consortium may pass between pores many times.

      (2) Can the authors describe in more detail how Equation (3) is derived? Why doesn’t it depend on the gap size between the pores?

      We have provided a derivation of this equation in Appendix 2 of the new version. This derivation shows that the drift velocity U<sub>drift</sub> is proportional to the pore diameter and difference between the transition rates.

      The proportionality constant α depends on how the pores are connected together in space. In the original version, we wanted to highlight the role of the asymmetry of the transition rates, so we imagined a one dimensional network of pores without gaps. In this case, α \= 1. This reasoning was poorly explained in the previous version and we thank the reviewer for pointing this deficiency out. In the new version, we include the gap size and use the layout of pores in a square lattice with gaps, which is shown in figure 1. The proportionality constant for a square lattice in the absence of gaps√ would be 1/2. The limitations of photolithography require some gap that increase the proportionality constant to α \= 0.8344.

      We have updated the text, equation (3), and the figures to account for the finite gap sizes.

      (3) I found the second part of the abstract, related to the comparison between diverse bacteria, to be slightly misleading. Upon first reading, my expectation was that the authors carried out experiments with different species.

      We have modified the abstract to make clear that we rely on values taken from a literature review.

      (4) More information is needed on how many trajectories were used to produce the probability densities in Figures 1b-d. How were the densities computed?

      The probability distributions give the probability that a pixel in a pore is covered by a consortium. They reflect between 1.2 and 7 million measurements (depending on the panel) of the instantaneous positions of consortia. We have added a section (Lines 453–469) to Materials and Methods that describes exactly how these distributions were calculated.

      Reviewer #2 (Recommendations for the authors):

      (1) As mentioned under Weaknesses in the Public review, some results are less new than claimed here. The existence of an optimal magnetic moment has been shown by Codutti et al eLife eLife13:RP98001 in very similar experiments, where it was also proposed that this may be an evolutionary adaptation to the sediment habitat. The paper here provides additional evidence for this, and with better tracking and quantification, but previous work should be discussed. Likewise, the work by Dekharghani et al. that is mentioned rather suddenly in the Results section appears to be a crucial previous state of the art and could already be mentioned in the introduction.

      We thank the reviewer for bringing this paper, which came out as we were writing this manuscript, to our attention. The hypothesis that there is an optimal phenotype that balances magnetotaxis with obstacle avoidance—and that natural selection could guide organisms to this optimum—goes back to at least 2022. It seems that Codutti et al independently came up with this same hypothesis and provided the first test.

      We have substantively rewritten the introduction (Lines 46–58) to better contextualize our work and give due attention to Dekharghani et al.

      (2) The first paragraph of Results also contains background information and could be moved into the introduction.

      As part of the rewrite to better contextualize our work, we moved the first two paragraphs of results to the introduction.

      (3) I found Figure 1 a bit confusing and it took me some time to understand the geometry. I think the black obstacles are very dominant to the viewer’s eye and draw attention away from the essentially circular shape of the pores. Likewise, I am not sure that cutting the neighboring pores off in a circular fashion in Figures 1b-d was the best choice. The authors should think about whether the presentation can be improved. Likewise, when describing the direction of the field in the text, I would suggest adding that it is along the horizontal direction in Figure 1.

      We have modified the figure and the text as the reviewer suggests.

      (4) That collisions with a pore wall are an important mechanism of changing direction is clear and it is nice to see the paper demonstrate that this mechanism is dominant over rotational diffusion. However, this may not be universal, as (i) rotational diffusion is more important for smaller cells and (ii) interaction with walls can result in all kinds of different behaviors than complete randomization (e.g. swimming along the walls as shown in microfluidic chambers, Ostapenko et al. Phys Rev Lett 2018, Codutti et al. eLife 2022, or reversals, Kuhn et al PNAS 2017). Here, it appears that complete randomization of the direction is an assumption, but this could be tested/quantified by analyzing the trajectories.

      This is an excellent point. We have modified the text to describe qualitatively how these tendencies would shift the Critical Scattering number. We also note in the text that there is evidence of these differences in Fig 3. The Desulfobacterota are shifted upwards in Fig 3 relative to the α-proteobacteria. This shift indicates that Desulfobacterota tend to live at slightly greater scattering numbers of 0.9±0.3 than the α-proteobacteria, which live at scattering number 0.37 ± 0.03. It is likely that this difference reflects taxonomic differences in rotational diffusion and cell-wall interactions.

      It is true that total randomization of the direction is indeed an assumption, and it is stated as such in line 189. We performed all of the numerics to find the solid curves in Fig 2 before we got any experimental data and so, at the time, total randomization seemed like a fair choice. Looking at Fig 2b, it is clear that these numerics systematically overestimate k<sub>−</sub>. We believe that this error is do to the assumption of total randomization.

      As this effect is small and does not change any of the conclusions of the paper and Codutti et al were able to publish their paper in the time that we were writing ours, we feel some urgency to move forward.

      (5) From the manuscript it is not fully clear to what extent experiments and simulations are or can be quantitatively compared. For example: is the curve (“fit”) in Figure 2c based on the simulations? Is there an explicit expression or is this just a spline or something like that? Why does Figure 5 (simulation) show the velocity as a function of Sc<sup>−1</sup>and Figure 2 (experiment) as a function of Sc? It looks to me as if a quantitative comparison could be achieved.

      The original version of Figure 2 shows a quantitative comparison between theory and experiment with no fit parameters. The data points are the result of experiments in which consortia are tracked as they as they move between connected pores. The solid line is a found by interpolating a smooth curve through the data from simulations. As we make clear in the new version (Lines 537–551), this blue curve is the most probable smooth curve that explains the simulations.

      We have added the simulations to figure 2 so that a single panel includes the data, the simulations, and the smooth curve. To further make clear that this comparison is quantitative and parameter free, we have added a panel to Figure 2. This panel directly compares the prediction to observation and is independent of the blue curve.

      As was noted (deep within the methods section) in the original version, our numerics can exactly simulate Sc = ∞. Consequently, it was reasonable to simulate parameters that are uniformly spaced in Sc<sup>−1</sup>.

      (6) While I like the idea behind Figure 3, the data shown here is not as convincing as suggested. If one looks at the data without the black line, I think one gets a weaker dependence. The correlation between U<sub>0</sub> and γ<sub>geo</sub> is likely not as strong as it seems. Calculating a correlation coefficient might be helpful here. In any case, the assumptions going into this figure should be discussed more explicitly and the results should in my opinion be phrased more cautiously (I tend to believe what the authors claim, but I don’t think the evidence for this point is very strong).

      We appreciate the reviewer’s skepticism. However, we believe that the data are stronger than one might understand from the previous text. We have rewritten the text (Lines 219–291) and included new analysis, figures, and explanation to make three points clear.

      (a) It is surprising that speed, magnetic moment, and mobility all vary tremendously(between one and three orders of magnitude) across taxa and environment, however, their dimensionless combination Sc is narrowly distributed. We have added a panel to Fig. 3 to show the measured Scattering numbers.

      It is notable that there are no adjusted parameters in the calculation of the Scattering numbers: it is a simple dimensionless combination of phenotypic and environmental parameters. All but one of these parameters (the pore size) is measured either by us or by other authors. The pore radius is likely narrowly distributed. We measure it at our field site and, when it is not reported, we use a value typical of the geological and fluvial environment. Just as the size of sand grains does not vary greatly between the beaches of Australia, Africa, and California, it is a good assumption that the pore spaces that host these magnetotactic bacteria do not vary tremendously in size.

      (b) In the new version we compare the Scattering number statistics to a parameterfree null model of phenotypic diversity. We argue in the text that it is appropriate to bootstrap over the phenotypic diversity of species. This null model provides the correct method to calculate p-values as the variability in the Scattering numbers is neither identically distributed nor normally distributed.

      We use this null model to show that—given the measured phenotypic diversity across species—the probability that fifteen random species would fall within the measured range of Scattering numbers that is consistent with optimal navigation is ∼ 10<sup>−6</sup>. This result is strong evidence that the phenotypic variables exhibit the correlations that are predicted by our analysis.

      (c) The correlation between U<sub>0</sub>/r and γ<sub>geo</sub> is reasonably strong. I think that our choice of axes in Fig 3, which were chosen to fit the legend, make the data look flatter than then they actually are. Here are the same data plotted without the line with tighter axes:

      Author response image 1.

      With the exception of the very first point and the very last point, the data appear to our eyes to be pretty correlated. This impression is born out by a calculation of the correlation coefficient which gives 0.77. The p-value is 4 × 10<sup>−4</sup>. We have included these values in the main text to clarify that this correlation is both statistically significant and of primary importance.

      (7) There is a comment at the end of the discussion that the evolutionary hypothesis could be tested by transferring the magnetotaxis genes to nonmagnetotactic organisms. This would indeed be highly desirable, but this is very difficult as indicated by the successful efforts in that direction (which often are only moderately magnetic/magnetotactic), see Kolinko et al Nature Nanotech 2014, Dziuba et al Nature Nanotech 2024.

      Thank you for highlighting these references, which we have included. We agree that these experiments will be challenging. Our results make a prediction about the evolution of these strains, so it seems worth mentioning this fact. We feel that this manuscript is not the correct space for a detailed description of challenges that we will encounter should we pursue this direction of study.

      (8) A section on how the bacterial samples were obtained could be added in Methods.

      We have done so.

      Additional Changes

      (1) In the original version, we feared that the consortia in the microfluidic device arepoorly representative of the natural population. Consequently, we used the values from previous experiments, which we performed using consortia taken from the same pond. Since submitting this manuscript we have undertaken new experiments that allowed us to measure the Scattering number of individual consortia. It turns out the effect is smaller than we worried. We have included these measurements in the new version. We find that even as the most common phenotypes vary over the course of time, the Scattering number remains constant. This result is additional evidence that there is strong selective pressure to optimally navigate.

      As a result of these additions, we have added an author, Julia Hernandez, who contributed to these experiments and analysis.

      (2) We have expanded the table of phenotypic variable in Appendix 1 to make it easier forother researchers to reproduce our analysis.

    1. Author response:

      The following is the authors’ response to the original reviews

      Public Reviews:

      Reviewer #1 (Public review):

      Hearing and balance rely on specialized ribbon synapses that transmit sensory stimuli between hair cells and afferent neurons. Synaptic adhesion molecules that form and regulate transsynaptic interactions between inner hair cells (IHCs) and spiral ganglion neurons (SGNs) are crucial for maintaining auditory synaptic integrity and, consequently, for auditory signaling. Synaptic adhesion molecules such as neurexin-3 and neuroligin-1 and -3 have recently been shown to play vital roles in establishing and maintaining these synaptic connections ( doi: 10.1242/dev.202723 and DOI: 10.1016/j.isci.2022.104803). However, the full set of molecules required for synapse assembly remains unclear.

      Karagulan et al. highlight the critical role of the synaptic adhesion molecule RTN4RL2 in the development and function of auditory afferent synapses between IHCs and SGNs, particularly regarding how RTN4RL2 may influence synaptic integrity and receptor localization. Their study shows that deletion of RTN4RL2 in mice leads to enlarged presynaptic ribbons and smaller postsynaptic densities (PSDs) in SGNs, indicating that RTN4RL2 is vital for synaptic structure. Additionally, the presence of "orphan" PSDs-those not directly associated with IHCs-in RTN4RL2 knockout mice suggests a developmental defect in which some SGN neurites fail to form appropriate synaptic contacts, highlighting potential issues in synaptic pruning or guidance. The study also observed a depolarized shift in the activation of CaV1.3 calcium channels in IHCs, indicating altered presynaptic functionality that may lead to impaired neurotransmitter release. Furthermore, postsynaptic SGNs exhibited a deficiency in GluA2/3 AMPA receptor subunits, despite normal Gria2 mRNA levels, pointing to a disruption in receptor localization that could compromise synaptic transmission. Auditory brainstem responses showed increased sound thresholds in RTN4RL2 knockout mice, indicating impaired hearing related to these synaptic dysfunctions.

      The findings reported here significantly enhance our understanding of synaptic organization in the auditory system, particularly concerning the molecular mechanisms underlying IHC-SGN connectivity. The implications are far-reaching, as they not only inform auditory neuroscience but also provide insights into potential therapeutic targets for hearing loss related to synaptic dysfunction.

      We would like to thank the reviewer for appreciating the work and the advice that helped us to further improve the manuscript. We have carefully addressed all concerns, please see our point-per-point response below and the revised manuscript.

      Reviewer #2 (Public review):

      Summary:

      Kargulyan et al. investigate the function of the transsynaptic adhesion molecule RTN4RL2 in the formation and function of ribbon synapses between type I spiral ganglion neurons (SGNs) and inner hair cells. For this purpose, they study constitutive RTN4RL2 knock-out mice. Using immunohistochemistry, they reveal defects in the recruitment of protein to ribbon synapses in the knockouts. Serial block phase EM reveals defects in SGN projections in mutants. Electrophysiological recordings suggest a small but statistically significant depolarized shift in the activation of Cav1.3 Ca<sup>2+</sup> channels. Auditory thresholds are also elevated in the mutant mice. The authors conclude that RTN4RL2 contributes to the formation and function of auditory afferent synapses to regulate auditory function.

      We would like to thank the reviewer for appreciating the work and the advice that helped us to further improve the manuscript. We have carefully addressed all concerns, please see our point-per-point response below and the revised manuscript.

      Strengths:

      The authors have excellent tools to analyze ribbon synapses.

      Weaknesses:

      However, there are several concerns that substantially reduce my enthusiasm for the study.

      (1) The analysis of the expression pattern of RTN4RL2 in Figure 1 is incomplete. The authors should show a developmental time course of expression up into maturity to correlate gene expression with major developmental milestones such as axon outgrowth, innervation, and refinement. This would allow the development of models supporting roles in axon outgrowth versus innervation or both.

      We agree that it would be valuable to show the developmental time course of RTN4RL2 expression. In response to the reviewer’s comment, we are providing RNAscope data from developmental ages E11.5, E12.5 and E16 in Figure 1. RTN4RL2 shows expression at E11.5/E12.5 both in the spiral ganglion and hair cell region, with first onset in the hair cells. We conclude that RTN4RL2 is expressed highest during fiber growth at embryonic stages and is downregulated during postnatal development maintaining low levels of expression during adulthood.

      (2) It would be important to improve the RNAscope data. Controls should be provided for Figure 1B to show that no signal is observed in hair cells from knockouts. The authors apparently already have the sections because they analyzed gene expression in SGNs of the knock-outs (Figure 1C).

      In Figure 1C gene expression in SGNs was assessed at p40, while the expression in hair cells is provided for p1 animals. Unfortunately, we do not have KO controls for p1 animals. However, as indicated in our manuscript, previously published RNA expression datasets do find RTN4RL2 expression in hair cells. Therefore, we think it is unlikely that our results are unspecific.

      (3) It is unclear from the immunolocalization data in Figure 1D if all type I SGNs express RTN4RL2. Quantification would be important to properly document the presence of RTN4RL2 in all or a subset of type I SGNs. If only a subset of SGNs express RTN4RL2, it could significantly affect the interpretation of the data. For example, SGNs selectively projecting to the pillar or modiolar side of hair cells could be affected. These synapses significantly differ in their properties.

      According to already published single cell RNAseq dataset from Shrestha et al., 2018, RTN4RL2 expression does not seem to show a clear type I SGN subtype specificity (Author response image 1). In response to the reviewer’s comment, we have further performed anti-Parvalbumin (PV) and anti-calretinin (CR) immunostainings in mid-modiolar cryosections of RTN4RL2<sup>+/+</sup> and RTN4RL2<sup>-/-</sup> cochleae. Parvalbumin was chosen to label all SGNs and CALB2 was chosen primarily as a type Ia SGN marker (Sun et al., 2018). We present the data from all analyzed samples below (figure 2 of this rebuttal letter). Cell segmentation masks of PV positive cells were obtained using Cellpose 2.0 and the average CR intensity was calculated in those masks. While the distributions of CR intensity and the ratio of CR and PV intensities are slightly shifted in RTN4RL2<sup>-/-</sup> cochleae, we take the data to suggest that the composition of the spiral ganglion by molecular type I SGN subtypes is largely unchanged in RTN4RL2<sup>-/-</sup> mice.

      Author response image 1.

      Author response image 1 cites single cell RNAseq data of Brikha R Shrestha, Chester Chia, Lorna Wu, Sharon G Kujawa, M Charles Liberman, Lisa V Goodrich. Sensory neuron diversity in the inner ear is shaped by activity. Cell. 2018 Aug 23; 174(5):1229-1246.e17. doi: 10.1016/j.cell/2018.07.007

      Author response image 2.

      Calretinin intensity distribution in spiral ganglion of RTN4RL2<sup>+/+</sup> and RTN4RL2<sup>-/-</sup> mice. (A) Mid-modiolar cochlear cryosections from RTN4RL2<sup>+/+</sup> (top) and RTN4RL2<sup>-/-</sup> (bottom) mice immunolabeled against Parvalbumin (PV) and Calretinin (CR). Scale bar = 20 mm. (B) Distribution of CR intensity in PV positive cells (N = 3 for each genotype). (C) Distribution of the ratio of CR and PV intensities (N = 3 for each genotype).

      (4) It is important to show proper controls for the RTN4RL2 immunolocalization data to show that no staining is observed in knockouts.

      Unfortunately, our recent attempts to perform RTN4RL2 immunostainings on cryosections failed and therefore, we decided to remove the RTNr4RL2 immunostainings from Figure 1. We have adjusted the results section accordingly.

      (5) The authors state in the discussion that no staining for RTN4RL2 was observed at synaptic sites. This is surprising. Did the authors stain multiple ages? Was there perhaps transient expression during development? Or in axons indicative of a role in outgrowth, not synapse formation?

      We thank the reviewer for the comment. We have now tried RTN4RL2 immunostainings on cryosections at several developmental stages, but unfortunately this time did not succeed to obtain reproducible and reliable results. Therefore, we decided to also remove the previous immunostainings from Figure 1. We have adjusted the results section as well as removed our statement of not detecting RTN4RL2 near the synaptic regions from the discussion.

      (6) In Figure 2 it seems that images in mutants are brighter compared to wildtypes. Are exposure times equivalent? Is this a consistent result?

      Yes, the samples were prepared in parallel, imaged and analyzed in the same manner.

      No, we did not observe consistent differences in brightness and also did not find it in the exemplary images of figure 2.

      (7) The number of synaptic ribbons for wildtype in Figure 2 is at 10/IHCs, and in Figure 2 Supplementary Figure 2 at 20/IHCs (20 is more like what is normally reported in the literature). The value for mutant similarly drastically varies between the two figures. This is a significant concern, especially because most differences that are reported in synaptic parameters between wild-type and mutants are far below a 2-fold difference.

      The key message is that there is no difference in the numbers of ribbons and synapses between the genotypes for the cochlear apex (~10 ribbons/IHCs, Figure 2 and Figure 2-figure supplement 2) and the mid- and base of the cochlea (more ribbons/IHCs, Figure 2-figure supplement 2). Figure 2-figure supplement 3 (now Figure 3) shows that there is a massive reduction of postsynaptic GluA2, while both Figure 2 and Figure 2-figure supplement 2 indicate that the number synapses is normal. These are two different data sets and while we closely collaborated and also shared the Moser lab protocols and analysis routines, we agree that there is a difference in the absolute synapse count, which most likely was an observer difference and different choice of tonotopic positions of analysis. In Figure 2 only the apical hair cells have been analyzed. The Moser lab, since establishing the immunofluorescence-based quantification of synapse number (Khimich et al., 2005) reported tonotopic differences in synapse counts (focus of Meyer et al., 2009 and reported by others: e.g. Kujawa and Liberman, 2009): apical and basal IHCs lower synapse numbers than mid-cochlear IHCs.

      (8) The authors report differences in ribbon volume between wild-type and mutant. Was there a difference between the modiolar/pillar region of hair cells? It is known that synaptic size varies across the modiolar-pillar axis. Maybe smaller synapses are preferentially lost?

      We thank the reviewer for the comment. Unfortunately, our already acquired datasets from 3-week-old mice did not allow us to check whether the previously described modiolar-pillar gradient of the ribbon size was collapsed in RTN4RL2<sup>-/-</sup> mice due to the not so well-preserved morphology of the inner hair cells in our preparations. However, since the number of the ribbons is not changed in the RTN4RL2 KO mice, we do not think that the increase in the ribbon size is due to the loss of small ribbons. In response to the reviewers comment we have analyzed the modiolar-pillar gradient of the ribbon size in IHCs of middle turn of the cochlea form a newly acquired dataset of 14-week-old mice. We took the fluorescence intensity of Ctbp2 positive puncta as a proxy for the ribbon size. In these older mice we found a preserved modiolar-pillar gradient of the ribbon size (larger ribbons at the modiolar side). We summarized the results in the below Author response image 3.

      Author response image 3.

      The modiolar-pillar gradient of ribbon size is preserved in RTN4RL2<sup>-/-</sup> IHCs. (A) Maximum intensity projections of approximately 2 IHCs stained against Vglut3 and Ctbp2 from 14-week-old RTN4RL2<sup>+/+</sup> (left) and RTN4RL2<sup>-/-</sup> (right) mice. Scale bar = 5 mm. (B) Synaptic ribbons on the modiolar side show higher fluorescence intensity than the ones on the pillar side of mid-cochlear IHCs in both RTN4RL2<sup>+/+</sup> (left, N=2) RTN4RL2<sup>-/-</sup> (right, N=2) mice. (C) Average fluorescence intensity of modiolar ribbons per IHC is higher than the average fluorescence intensity of pillar ribbons (paired t-test, p < 0.001).

      (9) The authors show in Figure 2 - Supplement 3 that GluA2/3 staining is absent in the mutants. Are GluA4 receptors upregulated? Otherwise, synaptic transmission should be abolished, which would be a dramatic phenotype. Antibodies are available to analyze GluA4 expression, the experiment is thus feasible. Did the authors carry out recordings from SGNs?

      In response to the reviewer’s comment, we have performed GluA4 stainings in RTN4LR2<sup>-/-</sup> mice and did not detect any GluA4 positive signal in the mutants (new Figure 3-figure supplement 1). Unfortunately, our animal breeding license was expired at the time we received the reviews and that is why our results are from 14-week-old animals. To verify that the absence of GluA4 signal is not due to potential PSD loss in 14-week-old RTN4RL2<sup>-/-</sup>, we have additionally performed anti-Ctbp2, anti-Homer1 and anti-Vglut3 stainings in 14-week-old animals. Despite the reduced number, we still observed juxtaposing pre- and postsynaptic puncta. We assume that the reviewer asks for patch-clamp recordings from SGNs, which are, as we are confident the reviewer is aware of, technically very challenging and beyond the scope of the present study but an important objective for future studies.  In response to the reviewers comment we have added a statement to the discussion pointing to these patch-clamp recordings from SGNs as important objective for future studies.

      (10) The authors use SBEM to analyze SGN projections and synapses. The data suggest that a significant number of SGNs are not connected to IHCs. A reconstruction in Figure 3 shows hair cells and axons. It is not clear how the outline of hair cells was derived, but this should be indicated. Also, is this a defect in the formation of synapses and subsequent retraction of SGN projections? Or could RTN4RL2 mutants have a defect in axonal outgrowth and guidance that secondarily affects synapses? To address this question, it would be useful to sparsely label SGNs in mutants, for example with AAV vectors expression GFP, and to trace the axons during development. This would allow us to distinguish between models of RTN4RL2 function. As it stands, it is not clear that RTN4RL2 acts directly at synapses.

      We agree with the reviewer on the value of a developmental study of afferent connectivity but consider this beyond the scope of the present study. In response to the reviewer's comment, we have replaced the IHC outlines with volume-reconstructed IHCs in Figure 3B (now Figure 4B). Moreover, as shown in Figure 3F (now Figure 4F), most if not all type-I SGNs (both with and without ribbon) were unbranched in the mutants just like in wildtype (also shown for a larger sample in Hua et al., 2021), arguing against morphological abnormality during development.

      (11) The authors observe a tiny shift in the operation range of Ca<sup>2+</sup> channels that has no effect on synaptic vesicle exocytosis. It seems very unlikely that this difference can explain the auditory phenotype of the mutant mice.

      We assume that the statement refers to the normal exocytosis of mutant IHCs at the potential of maximal Ca<sup>2+</sup> influx (Figure 3G and H, now Figure 4G and H). We would like to note that this experiment was performed to probe for a deficit of synapse function beyond that of the Ca<sup>2+</sup> channel activation, but did not address the impact of the altered voltage—dependence of Ca<sup>2+</sup> channel activation. In response to the reviewer’s comment, we have now added further discussion to more clearly communicate that for the range of receptor potentials achieved near sound threshold we expect impaired IHC exocytosis as the Ca<sup>2+</sup> channels require slightly more depolarization for activation in the mutant IHCs.

      (12) ABR recordings were conducted in whole-body knockouts. Effects on auditory thresholds could be a secondary consequence of perturbation along the auditory pathway. Conditional knockouts or precisely designed rescue experiments would go a long way to support the authors' hypothesis. I realize that this is a big ask and floxed mice might not be available to conduct the study.

      Thanks for this helpful comment and, indeed, unfortunately, we do not have conditional KO mice at our disposal. We totally agree that this will be important also for clarifying the role of IHC vs. SGN expression of RTN4RL2. In response to the reviewer’s comment, we now discussed the shortcoming of using constitutive RTN4RL2<sup>-/-</sup> mice and added this important experiment on IHC and SGN specific deletion of RTN4RL2 as an objective of future studies.

      Reviewer #3 (Public review):

      In this study, the authors used RNAscope and immunostaining to confirm the expression of RTN4RL2 RNA and protein in hair cells and spiral ganglia. Through RTN4RL2 gene knockout mice, they demonstrated that the absence of RTN4RL2 leads to an increase in the size of presynaptic ribbons and a depolarized shift in the activation of calcium channels in inner hair cells. Additionally, they observed a reduction in GluA2/3 AMPA receptors in postsynaptic neurons and identified additional "orphan PSDs" not paired with presynaptic ribbons. These synaptic alterations ultimately resulted in an increased hearing threshold in mice, confirming that the RTN4RL2 gene is essential for normal hearing. These data are intriguing as they suggest that RTN4RL2 contributes to the proper formation and function of auditory afferent synapses and is critical for normal hearing. However, a thorough understanding of the known or postulated roles of RTN4Rl2 is lacking.

      We would like to thank the reviewer for appreciating the work and the advice that helped us to further improve the manuscript. We have carefully addressed all concerns, please see our point-per-point response below and the revised manuscript.

      While the conclusions of this paper are generally well supported by the data, several aspects of the data analysis warrant further clarification and expansion.

      (1) A quantitative assessment is necessary in Figure 1 when discussing RNA and protein expression. It would be beneficial to show that expression levels are quantitatively reduced in KO mice compared to wild-type mice. This suggestion also applies to Figure 2-supplement 3.D, which examines expression levels.

      The processing of our control and KO samples for RNAscope was not strictly done in parallel and therefore we would like to refrain from quantitative comparison.

      (2) In Figure 2, the authors present a morphological analysis of synapses and discuss the presence of "orphan PSDs." I agree that Homer1 not juxtaposed with Ctbp2 is increased in KO mice compared to the control group. However, in quantifying this, they opted to measure the number of Homer1 juxtaposed with Ctbp2 rather than directly quantifying the number of Homer1 not juxtaposed with Ctbp2. Quantifying the number of Homer1 not juxtaposed with Ctbp2 would more clearly represent "orphan PSDs" and provide stronger support for the discussion surrounding their presence.

      We appreciate the reviewer’s comment. We did not perform this analysis primarily because “orphan” Homer1 puncta, as seen in our immunostainings, are distributed away from hair cells in diverse morphologies and sizes. This makes distinguishing them from unspecific immunofluorescent spots—also present in wild-type samples—challenging. In response to the reviewer’s request, we analyzed the number of “orphan” Homer1 puncta in our previously acquired RTN4RL2<sup>+/+</sup> and RTN4RL2<sup>-/-</sup> samples. Using the surface algorithm in Imaris software, we applied identical parameters across all samples to create surfaces for Homer1-positive puncta (total Homer1 puncta). We quantified “orphan” Homer1 puncta as the difference between total and ribbon-juxtaposing Homer1 puncta and normalized this number to the IHC count. Our results showed 4.3 vs. 26.8 “orphan” Homer1 puncta per IHC in RTN4RL2<sup>+/+</sup> and RTN4RL2<sup>-/-</sup> samples, respectively. We note that variations in acquired volumes between samples may introduce confounding effects.

      (3) In Figure 2, Supplementary 3, the authors discuss GluA2/3 puncta reduction and note that Gria2 RNA expression remains unchanged. However, there is an issue with the lack of quantification for Gria2 RNA expression. Additionally, it is noted that RNA expression was measured at P4. While the timing for GluA2/3 puncta assessment is not specified, if it was assessed at 3 weeks old as in Figure 2's synaptic puncta analysis, it would be inappropriate to link Gria2 RNA expression with GluA2/3 protein expression at P4. If RNA and protein expression were assessed at P4, please indicate this timing for clarity.

      GluA2/3 immunostainings were performed in 1 to 1.5-month-old animals. We apologize for not indicating this before and have now included it in Figure 3 legend. The processing of our control and KO samples for RNAscope was not strictly done in parallel and therefore we would like to refrain from quantitative comparison.

      (4) In Figure 3, the authors indicate that RTN4RL2 deficiency reduces the number of type 1 SGNs connected to ribbons. Given that the number of ribbons remains unchanged (Figure 2), it is important to clearly explain the implications of this finding. It is already known that each type I SGN forms a single synaptic contact with a single IHC. The fact that the number of ribbons remains constant while additional "orphan PSDs" are present suggests that the overall number of SGNs might need to increase to account for these findings. An explanation addressing this would be helpful.

      In Figure 3 (now Figure 4), we found additional type-1 SGNs that are unconnected to IHC, in good agreement with “orphan PSDs” observed under the light microscope. Indeed, we also confirmed monosynaptic, unbranched fiber morphology (Figure 3F, now Figure 4F). Together, these results imply about a 20% increase in the overall number of SGNs, which however we did not observe in SGN soma counting.

      (5) In Figure 4F and 5Cii, could you clarify how voltage sensitivity (k) was calculated? Additionally, please provide an explanation for the values presented in millivolts (mV).

      Voltage sensitivity (k) was calculated as the slope of the Boltzmann fit to the fractional activation curves: , Where G is conductance, G<sub>max</sub> is the maximum conductance, V<sub>m</sub> is the membrane potential, V<sub>half</sub> is the voltage corresponding to the half maximal activation of Ca<sup>2+</sup> channels and k (slope of the curve) is the voltage sensitivity of Ca<sup>2+</sup> channel activation. We have now added this to our Materials and Methods section.

      (6) In Figure 6, the author measured the threshold of ABR at 2-4 months old. Since previous figures confirming synaptic morphology and function were all conducted on 3-week-old mice, it would be better to measure ABR at 3 weeks of age if possible.

      ABR measurements for comparisons in a cohort of age-matched mice require fully developed individuals. 3 weeks is the minimum age that is regarded for a mature ear. However, variation in developmental differences among one litter is very frequent that affects normal hearing thresholds. From our own experience we do not regard the ear fully functional before 6 weeks of age. Then hearing thresholds are lowest indicating full functionality. Since the C57BL/6 background strain has a genetic defect in the Cadherin 23-coding gene (Cdh23) at the ahl locus of mouse chromosome 10 these mice exhibit early onset and progression of age-related hearing loss starting at 5–8 months (Hunter & Willott, 1987). Therefore, we chose a “safe” time window for stable and unaffected ABR recordings of 2-4 months to provide most representative data.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      (1) Please include information on the validation of all the antibodies used in this study, or reference the relevant work where the antibodies were previously validated.

      In response to the reviewer’s comment, we have now included a table listing all primary antibodies used in this study. Where possible, we provide references for knockout (KO) validation. Otherwise, we refer to the manufacturer’s information, as provided in the respective datasheets.

      (2) Figure 2 illustrates the pre- and postsynaptic changes observed in RTN4RL2 knockout (KO) mice. Please specify the age of the mice and the cochlear region depicted and analyzed in Figure 2.

      We thank the reviewer for the comment. The IHCs of apical cochlear region were analyzed in mice at 3 weeks of age. We have now added this to the figure legend.

      (3) The discovery of orphan SGN neurites in RTN4RL2 KO mice is particularly intriguing. I wonder whether the additional Homer1-positive puncta illustrated in Figure 2 are present in these orphan SGN neurites, which would suggest that they may be functional. Conducting immunohistochemistry (IHC) labeling for type I SGN neurites using an anti-Tuj1 antibody, along with Homer1, would help localize the additional Homer1 puncta shown in Figure 2. Additionally, the "extra" Homer1 puncta appears less striking in the data presented in Figure 2-Supplement 2. Quantifying the number of Homer1 puncta in wild-type versus KO mice across different cochlear regions will help visualize the Figure 2-Supplement 2 data and relate the presence of extra neurites to the increased auditory brainstem response (ABR) thresholds observed at all frequencies.

      We thank the reviewer for the comment and we agree that localizing orphan PSDs on the SGN neurites would be very useful. Unfortunately, the animal breeding license in the Göttingen lab had expired. At the time we received the reviews we only had access to 14-week-old animals and could not perform the stainings in animals which would have comparable age range to the rest of the study (3-4 weeks). The phenotype of extra Homer1 puncta was not as drastic in 14-week-old animals as it was in previously stained 3-week-old animals. Nevertheless, we still tried NF200, Homer1 and Vglut3 immunostainings in 14-week-old animals. We present representative single imaging planes of NF200, Homer1 and Vglut3 stainings in Author response image 4. Additionally, we provide exemplary images from 7-week-old RTN4RL2<sup>-/-</sup>, where it looks like that the orphan Homer1 puncta are found on calretinin positive neurites.

      Author response image 4.

      Attempts to localize “orphan” Homer1 patches on type I SGN neurites. (A) Single exemplary imaging planes of apical IHC region from RTN4RL2<sup>+/+</sup> (left) and RTN4RL2<sup>-/-</sup> (right) mice immunolabeled against NF200, Vglut3 and Homer1. White arrows show putative “orphan” Homer1 puncta on NF200 positive neurites. Scale bar = 5 mm. (B) Maximum intensity projections of representative confocal stacks of IHCs from RTN4RL2<sup>-/-</sup> mice immunolabeled against Calretinin and Homer1. Scale bars = 5 mm. White arrows show possible “orphan” Homer1 puncta on Calretinin positive boutons.

      (4) The authors noted a reduction in the number of GluA2/3-positive puncta in RTN4RL2 KOs, as shown in Figure 2-Supplement 3. However, in the Results section (page 5, line 124), it is unclear whether the authors refer to a reduction in fluorescence intensity or the number of puncta. Please clarify this.

      We thank the reviewer for the comment. We refer to the number and have now added this to the manuscript.

      (5) I find it particularly interesting that, despite the presence of smaller but synaptically engaged Homer1-positive SGN neurites, these appear to lack or present a reduction in the number of GluA2/3 puncta, and that GluA2/3 puncta are observed in non-ribbon juxtaposed neurites. Therefore, I suggest including GluA2/3 (Fig2 supplement 3) data in the main figure. It would be valuable to determine whether the orphan neurites express both Homer1 and GluA2/3, which could indicate that the defect is not solely due to reduced GluA2/3 expression at the formed synapses, but also to the presence of additional orphan synapses. I would also mention in the discussion how the phenotype of the RTN4L2 KO compares to the GluA2/3 KO and if the lack of GluA2/3 at the AZ could explain the increase in ABR threshold. Quantification of GluA2/3 puncta at the apical, middle, and basal region would also help understand the auditory phenotype of the KO mice.

      We have changed Figure2-figure supplement 3 to become a main figure (Figure 3) based on the recommendation of the reviewer. We agree, that it would be valuable to perform immunohistochemistry combining anti-GluA2/3 and anti-Homer1 and anti-Ctbp2 antibodies to see if the “orphan” Homer1 patches house GluA2/3 not juxtaposing synaptic ribbons. Unfortunately, as mentioned above, due to the expiration of our animal breeding and experimentation licenses we did not manage to do those experiments. We have however performed stainings with anti-GluA4 antibodies and could not detect GluA4 signal in RTN4RL2<sup>-/-</sup> mice (Figure 3-figure supplement 1). This potentially could explain the more drastic ABR threshold elevation in RTN4RL2<sup>-/-</sup> mice compared to e.g. GluA3 KO mice. We have now made this clearer in our discussion.

      (6) I suggest considering the use of color-blind friendly palettes for figures and graphs in this manuscript to enhance clarity and ensure that the findings are accessible to a wider audience and improve the overall effectiveness of the presentation. Please use color-blind-friendly schemes in Figure 1 and Figure 2 Supplement 3.

      Done.

      (7) Could you please explain what "XX {plus minus} Y, SD = W" means in the figure legends?

      Mean ± SEM (standard error of the mean), SD (standard deviation) are indicated in the legends. In response to the reviewer comment we have now added an explanation in the Materials and Methods –> Data analysis and statistics section.

      (8) Please include information about the ear tested (left or right or both).

      Both ears were tested. Since there was no significant difference between right and left ear we did not further consider this factor. We will add this fact more precisely in the Material and methods section.

      Reviewer #3 (Recommendations for the authors):

      (1) Line 90: Why not show this control, it is a nice control.

      Unfortunately, our recent attempts to perform RTN4RL2 immunostaining on cryosections were unsuccessful. Therefore, we decided to remove RTN4RL2 immunostaining from Figure 1 and have adjusted the results section accordingly.

      (2) Line 94: Please provide a reference for these interactions.

      Done.

    1. Author Response

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      The Hedgehog (HH) protein family is important for embryonic development and adult tissue maintenance. Deregulation or even temporal imbalances in the activity of one of the main players in the HH field, sonic hedgehog (SHH), can lead to a variety of human diseases, ranging from congenital brain disorders to diverse forms of cancers. SHH activates the GLI family of transcription factors, yet the mechanisms underlying GLI activation remain poorly understood. Modification and activation of one of the main SHH signalling mediators, GLI2, depends on its localization to the tip of the primary cilium. In a previous study the lab had provided evidence that SHH activates GLI2 by stimulating its phosphorylation on conserved sites through Unc-51-like kinase 3 (ULK3) and another ULK family member, STK36 (Han et al., 2019). Recently, another ULK family member, ULK4, was identified as a modulator of the SHH pathway (Mecklenburg et al. 2021). However, the underlying mechanisms by which ULK4 enhances SHH signalling remained unknown. To address this question, the authors employed complex biochemistry-based approaches and localization studies in cell culture to examine the mode of ULK4 activity in the primary cilium in response to SHH. The study by Zhou et al. demonstrates that ULK4, in conjunction with STK36, promotes GLI2 phosphorylation and thereby SHH pathway activation. Further experiments were conducted to investigate how ULK4 interacts with SHH pathway components in the primary cilium. The authors show that ULK4 interacts with a complex formed between STK36 and GLI2 and hypothesize that ULK4 functions as a scaffold to facilitate STK36 and GLI2 interaction and thereby GLI2 phosphorylation by STK36. Furthermore, the authors provide evidence that ULK4 and STK36 co-localize with GLI2 at the ciliary tip of NIH 3T3 cells, and that ULK4 and STK36 depend on each other for their ciliary tip accumulation. Overall, the described ULK4-mediated mechanism of SHH pathway modulation is based on detailed and rigorous Co-IP experiments and kinase assays as well as confocal imaging localization studies. The authors used various mutated and wild-type constructs of STK36 and ULK4 to decipher the mechanisms underlying GLI2 phosphorylation at the tip of the primary cilium. These novel results on SHH pathway activation add valuable insight into the complexity of SHH pathway regulation. The data also provide possible new strategies for interfering with SHH signalling which has implications in drug development (e.g., cancer drugs).

      However, it will be necessary to explore additional model systems, besides NIH3T3, HEK293 and MEF cell cultures, to conclude on the universality of the mechanisms described in this study. Ultimately, it needs to be addressed whether ULK4 modulates SHH pathway activity in vivo. Is there evidence that genetic ablation of ULK4 in animal models leads to less efficient SHH pathway induction? It also remains to be resolved how ULK3 and ULK4 act in distinct or common manners to promote SHH signalling. Another remaining question is, whether cell type- and tissue-specific features exist, that play a role in ULK3- versus ULK4-dependent SHH pathway modulation. In particular for the studies on ciliary tip localization of factors, relevant for SHH pathway transduction, a higher temporal resolution will be needed in the future as well as a deeper insight into tissue/ cell type-specific mechanisms. These caveats, mentioned here, don't have to be addressed in new experiments for the revision of this manuscript but could be discussed.

      We agree with the reviewer that it would be important to investigate in the future the in vivo function Ulk4 in Shh signaling, the relationship between Ulk3 and Ulk4/Stk36, and possible cell type/tissue specificity of these two kinase systems. This will need the generation of single and double knockout mice and examine Hh related phenotypes in different tissues and developmental stages. The precise mechanism by which Ulk4 and Stk36 are translocated to the ciliary tip is also an important and unsolved issue. We include several paragraphs in the “discussion” section to address these outstanding questions for future study.

      Reviewer #2 (Public Review):

      The authors provide solid molecular and cellular evidence that ULK4 and STK36 not only interact, but that STK36 is targeted (transported?) to the cilium by ULK4. Their data helps generate a model for ULK4 acting as a scaffold for both STK36 and its substrate, Gli2, which appear to co-localise through mutual binding to ULK4. This makes sense, given the proposed role of most pseuodkinases as non-catalytic signaling hubs. There is also an important mechanistic analysis performed, in which ULK4 phosphorylation in an acidic consensus by STK36 is demonstrated using IP'd STK36 or an inactive 'AA' mutant, which suggests this phosphorylation is direct.

      The major strength of the study is the well-executed combination of logical approaches taken, including expression of various deletion and mutation constructs and the careful (but not always quantified in immunoblot) effects of depleting and adding back various components in the context of both STK36 and ULK3, which broadens the potential impact of the work. The biochemical analysis of ULK4 phosphorylation appears to be solid, and the mutational study at a particular pair of phosphorylation sites upstream of an acidic residue (notably T2023) is further strong evidence of a functional interaction between ULK4/STK36. The possibility that ULK4 requires ATP binding for these mechanisms is not approached, though would provide significant insight: for example it would be useful to ask if Lys39 in ULK4 is involved in any of these processes, because this residue is likely important for shaping the ULK4 substrate-binding site as a consequence of ATP binding; this was originally shown in PMID 24107129 and discussed more recently in PMID: 33147475 in the context of the large amount of ULK4 proteomics data released.

      The reviewer raised an interesting question of whether ATP binding to the pseudokinase domain of Ulk4 might be required for its function, i.e., by regulating the interaction with its binding partner. In a recent study (Preuss et al. 2020;PMID: 33147475), the critical Lys39 for ATP binding was converted to Arg (KR mutation); however, unlike in most kinases the KR mutation affect ATP binding, the K39R mutation in the Ulk4 pseudokinase did not affect ATP binding although it slightly increased ADP binding (PMID: 33147475). Another mutation made by Preuss et al(PMID: 33147475), N239L, affected protein stability, making it impossible to determine whether this mutation affect ATP binding. Therefore, in the absence of clear approach to perturb ATP binding without affecting the overall structure of Ulk4, it would be challenging to address whether ATP binding regulates the ability of Ulk4 to bind its substrates. Nevertheless, we discuss the possibility that ATP binding might regulate Ulk4/Stk36 interaction and Shh signaling.

      The discussion is excellent, and raises numerous important future work in terms of potential transportation mechanisms of this complex. It also explains why the ULK4 pseudokinase domain is linked to an extended C-terminal region. Does AF2 predict any structural motifs in this region that might support binding to Gli2?

      The extended C-terminal domain of Ulk4 contains Arm/HEAT repeats (protein-protein interacting domain), which are predicted by AF2 to form alpha helixes.

      A weakness in the study, which is most evident in Figure 1, where Ulk4 siRNA is performed in the NIH3T3 model (and effects on Shh targets and Gli2 phosphorylation assessed), is that we do not know if ULK4 protein is originally present in these cells in order to actually be depleted. Also, we are not informed if the ULK4 siRNA has an effect on the 'rescue' by HA-ULK4; perhaps the HA-ULK4 plasmid is RNAi resistant, or if not, this explains why phosphorylation of Gli2 never reaches zero? Given the important findings of this study, it would be useful for the authors to comment on this, and perhaps discuss if they have tried to evaluate endogenous levels of ULK4 (and Stk36) in these cells using antibody-based approaches, ideally in the presence and absence of Shh. The authors note early on the large number of binding partners identified for ULK4, and siRNA may unwittingly deplete some other proteins that could also be involved in ULK4 transport/stability in their cellular model.

      Due to the lack of reliable Ulk4 and Stk36 antibodies, we were unable to confirm knockdown efficiency by western blot analysis. Therefore, we relied on the measure Ulk4 and STk36 mRNA expression by RT-qPCR to estimate the knockdown efficiency (Fig 1- figure supplement 1). We used mouse Ulk4 shRNA to carry out the knockdown experiments in NIH3T3 and MEF cells while the human version of Ulk4 (hUlk4) was used for the rescue experiments (Fig 1- figure supplement 2; Fig. 8). We have confirmed that the mUlk4 shRNA targeting sequence is not conserved in hUlk4; therefore, the hULK4 construct is RNAi resistant. The rescue experiments strongly argue that the effect of Ulk4 RNAi on Shh signaling is due to loss of endogenous Ulk4. This argument is further strengthened by the observations that mutations that affected Ulk4 and Stk36 ciliary tip localization also affected Shh signaling such as Gli2 phosphorylation and Ptch1/Gli expression (Fig. 8).

      The sequence of ULK4 siRNAs is not included in the materials and methods as far as I can see.

      We have added the mouse Ulk4 RNAi target sequence in the revised version.

      Reviewer #3 (Public Review):

      In this manuscript, Zhou et al. demonstrate that the pseudokinase ULK4 has an important role in Hedgehog signaling by scaffolding the active kinase Stk36 and the transcription factor Gli2, enabling Gli2 to be phosphorylated and activated.

      Through nice biochemistry experiments, they show convincingly that the N-terminal pseudokinase domain of ULK4 binds Stk36 and the C-terminal Heat repeats bind Gli2.

      Lastly, they show that upon Sonic Hedgehog signaling, ULK4 localizes to the cilia and is needed to localize Stk36 and Gli2 for proper activation.

      This manuscript is very solid and methodically shows the role of ULK4 and STK36 throughout the whole paper, with well controlled experiments. The phosphomimetic and incapable mutations are very convincing as well. I think this manuscript is strong and stands as is, and there is no need for additional experiments.

      Overall, the strengths are the rigor of the methods, and the convincing case they bring for the formation of the ULK4-Gli2-Stk36 complex. There are no weaknesses noted. I think a little additional context for what is being observed in the immunofluorescence might benefit readers who are not familiar with these cell types and structures.

      We thank this reviewer for the positive comments.

      Recommendations For the Authors

      Reviewer #1 (Recommendations For The Authors):

      This elegant study has been thoroughly and thoughtfully designed and the dataset is solid. The biochemistry results are overall very convincing. Some data lack quantification and there needs to be more information on data analyses and statistics. The following suggestions and comments aim at strengthening the manuscript.

      1. Please provide quantification normalized to input for IP experiments (Figures 1 E - F; Figure 8 C). More information on data analyses and statistics should be provided and included as information in the figure legends.

      Thanks for the suggestions, we have done the quantification and statistics analyses for Figures 1E-G and Figure8 C as requested.

      1. Did the authors investigate whether overexpressing hULK4 in the control NIH3T3 cells leads to an increase in pS230/232 (related to Figure 1E)? This would nicely support the notion of a promoting effect of ULK4 on GLI2 phosphorylation.

      We did not. We speculated that overexpressing hULK4 may not significantly promote GLI2 phosphorylation because Ulk4 is a pseudokinase and endogenous Stk36 (the kinase partner of Ulk4) is limited.

      1. The CO-IP experiments to show GLI2 activation were performed in NIH3T3 cells, whereas HEK293 cells were used for the experiments shown in Figure 2. Is there a specific reason for switching between cell lines also for experiments shown in Figures 3 C- I? Did the authors repeat some of the key experiments in both cell lines?

      In mammalian cells, Shh-induced activation of GLI2 depends on primary cilia (Han et al., 2019). NIH3T3 cells form the primary cilia but HEK293T cells do not. Therefore, we used NIH3T3 cells to examine the processes that are regulated by the Shh treatment assay (e.g., the Shh-induced phosphorylation of GLI2 and STK36). The HEK293 cells were used to map binding domain between ULK4 and STK36/GLI2/SUFU due to the high transfection efficiency.

      1. In Figure 2 D-E the authors nicely showed that hUlk4N-HA interacted with CFP-Stk36 but not with Myc-Gli2/Fg-Sufu whereas hUlk4C-HA formed a complex with Myc-Gli2/Fg-Sufu but not with CFP-Stk36. In Figure 4E the authors showed in their Co-IP experiments that Fg-Stk36 and Myc-Gli2 form a complex independent of SHH treatment. Did the authors see some pull down of Stk36, still in complex with Gli2, using hUlk4C IP and pull down of Gli2, still in complex with Stk36, using hUlk4N IP?

      We did not test that. As we have shown in Figures 4A and 4E, knockdown of endogenous ULK4 nearly abolished the interaction between Myc-GLi2 and Fg-Stk36, suggesting that Ulk4 is the major scaffold to bring Skt36 and Gli2 together, and that there is little if any direct interaction between GLi2 and Stk36.

      1. Another method to verify hULK4-Stk36-Gli2 complex formation (Figure 4) would be helpful. For example, proximity ligation assays, tripartite split GFP assays, or colocalization based on expansion STED immunofluorescence microscopy could be performed to temporally and spatially resolve localization of Ulk4, Stk36 and Gli2 upon SHH stimulation in the primary cilium

      Thanks for the suggestions. We think that our current study using biochemical and cell biology approaches have provide sufficient evidence that Ulk4, Stk36 and Gli2 form complexes. We will keep in mind of those more sophisticated methods in our future endeavors.

      1. Please provide more representative images of Ulk4, Stk36 and Gli2 localization in NIH3T3 cells or lower magnification overview images showing more than one cell (Figure 5).

      We have provided more representative images in Figure 5- figure supplement 1A-F of the revised manuscript.

      1. Confirmation of the results shown in Figure 5 in a second cell line would strengthen the data.

      We have confirmed the results in MEFs (see Figure 5- figure supplement 1G-J)

      1. Did the authors add immunofluorescence for tubulin as a ciliary base marker to ensure correct assignment of ciliary tip versus ciliary base localization for quantification experiments (Figures 5 - 8)?

      It has been well documented that GLi2 is accumulated at the ciliary tip in respond to Shh treatment; therefore, we used Gli2 as a marker for ciliary tip where both Ulk4 and Stk36 were also accumulated. γ tubulin staining could be another marker to assign the ciliary tip vs base; however, the antibody combination we have did not allow us to simultaneously stain γ tubulin and acetylated tubulin (Ac-Tub).

      1. SMO localization as a further readout of SHH pathway activation might be considered to be added for some of the key results (e.g., Figure 6). Is SMO trafficking affected after depletion or overexpression of ULK4?

      Due to the lack of a workable antibody to detect endogenous Smo in our hands, we did not determine whether the trafficking of SMO is affected after depletion or overexpression of ULK4. However, we noticed that a recent study reported that the SHH-induced ciliary SMO accumulation was impaired in Ulk4 siRNA treated cells (Mecklenburg et al. 2021). We include this information and its implication in the discussion section

      1. Do the authors see ULK4 only at the ciliary tip after SHH stimulation or is there also a dynamic time-dependent localization along the ciliary shaft? The image in Figure 6E (dKO + Stk36 WT) seems to show ULK4 also in the shaft.

      Unlike Smo that is evenly distributed alone the axoneme of primary cilia, ULK4 is mainly accumulated at ciliary tips upon Shh stimulation. Ulk4 is also located at low levels outside the cilia and sometimes in the ciliary shaft during its transit to the ciliary tip (e.g., see Figure 5- figure supplement 1F1-2; J1-2).

      1. Is the immunofluorescence signal for Ulk4 significantly reduced after shRNA treatment to deplete Ulk4 (Figure 6A)?

      We constructed a cell line that stably expressed ULK4 shRNA. The knockdown efficiency was determined by measuring Ulk4 mRNA expression (Fig 1_figure supplement 1). Because we were unable to obtain a reliable ULK4 antibody for immunostaining, we did not examine by whether ULK4 signal was depleted by Ulk4 shRNA.

      1. The labelled ciliary tip resembles in some cases images seen for ciliary abscission. The authors could use membrane/ciliary membrane markers to ensure "intraciliary" localization of the investigated factors.

      Thanks for the suggestion. We will try that in our future experiments.

      1. How many replicates were used in the three independent quantitative RT-PCR experiments (Figure 1 A-D)?

      We used 3 replicates in each independent quantitative RT-PCR assay.

      1. Please provide p values or statement on no significance for the comparison between Ulk3 single and Ulk3/Ulk4 double knockdown (Figure 1C) and between Stk36 single and Stk36/Ulk4 double knockdown (Figure 1D; Fig1_Figure Supplement 2).

      Thanks for the suggestion, we have added the p value or “ns” as asked.

      1. Figure legends in general are a bit short could have some more detailed information.

      Thank you for the suggestion, we have revised the Figure legends as asked.

      1. What do the asterisks present in Figure 4 C-D?

      Thanks for the suggestion. The asterisks in Figure 4C-D indicated the full length STK36 and truncated form STK36N and STK36C fragments. We that included this information in the figure legend.

      1. The authors state that a previous study described ULK4 as a genetic modifier for holoprosencephaly and that this raised the possibility that ULK4 may participate in HH signal transduction. Primary ciliary localization of ULK4 in mouse neuronal tissue and SHH pathway modulation by ULK4 in cell culture have been shown by Mecklenburg et al. 2021 before. Maybe the authors could rephrase their introduction and discussion accordingly.

      Thanks for the suggestion, we have changed the introduction and discussion accordingly.

      1. Overexpression studies in heterologous systems using tagged proteins can potentially have an influence on their subcellular localization and function. Please discuss this caveat.

      We have mentioned this caveat in the “discussion” section of the revised manuscript. However, we have tried to express the transgene at low levels using the lentiviral vector containing a weak promoter to ensure that the exogenously expressed proteins are still regulated by Hh signaling. We have also confirmed that the tagged Ulk4 and Stk36 can rescue the loss of endogenous genes.

      1. More details in the Methods section should be provided on the SHH induction in NIH3T3 cells, HEK293 cells and MEFs.

      We have revised the methods section on Shh induction.

      1. ULK4 is known to have at least three isoforms that exhibit varying abundance across developmental stages in mice and humans (Lang et al., 2014) (DOI:10.1242/jcs.137604). Can the authors speculate on potential common and distinct functions of the different ULK4 isoforms on SHH pathway modulation based on their present results?

      It is interesting that Ulk4 has multiple isoforms in both mouse and human. Several short isoforms in both mouse and human lack the pseudokinase domain while one short isoform in mouse lacks the C-terminal region essential for Ulk4 ciliary tip localization. We speculate that the C-terminally deleted isoform may not have a function in the Shh pathway based on our results shown in Fig. 7 and 8 but might still have functions in other cellular processes.

      Reviewer #2 (Recommendations For The Authors):

      The paper is well written, and clear throughout, with excellent (up-to-date) citations to the field.

      We thank reviewer #2 for the positive comments.

      Reviewer #3 (Recommendations For The Authors):

      My only quibble is that the immunofluorescence images are a little confusing, especially to people outside of the field. Please include an image of the whole field and improve the captions. Is that a single cell for each cilia? Why are there so few cilia? The DAPI makes it seem like What are we looking at? Are those multiple nuclei in Figure 6? They seem a little small if that's the 5 uM scale bar

      We provide uncropped images of Figure 5E to show the entire cells (below). We have added some context to improve the captions. Most of the mammalian cells such as MEF and NIH3T3 cells contain a single primary cilium; however, mutilated cells do exist. The DAPI staining indicated the nuclei. The cells shown in Figure 6 have single nucleus (the scale should be 2 µM). Due to the unevenness of DAPI signals in the nuclei, only the strong signals (puncta) were shown for individual nuclei.

      Author response image 1.

      One small typo: GLL2 instead of GLI2 on line 363

      Thanks, we have corrected this spelling mistake.

    1. Author Response

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      The present work establishes 14-3-3 proteins as binding partners of spastin and suggests that this binding is positively regulated by phosphorylation of spastin. The authors show evidence that 14-3-3 >- spastin binding prevents spastin ubiquitination and final proteasomal degradation, thus increasing the availability of spastin. The authors measured microtubule severing activity in cell lines and axon regeneration and outgrowth as a prompt to spastin activity. By using drugs and peptides that separately inhibit 14-3-3 binding or spastin activity, they show that both proteins are necessary for axon regeneration in cell culture and in vivo models in rats.

      The following is an account of the major strengths and weaknesses of the methods and results.

      Major strengths

      -The authors performed pulldown assays on spinal cord lysates using GST-spastin, then analyzed pulldowns via mass spectrometry and found 3 peptides common to various forms of 14-3-3 proteins. In co-expression experiments in cell lines, recombinant spastin co-precipitated with all 6 forms of 14-3-3 tested.

      -By protein truncation experiments they found that the Microtubule Binding Domain of spastin contained the binding capability to 14-3-3. This domain contained a putative phosphorylation site, and substitutions that cannot be phosphorylated cannot bind to spastin.

      -spastin overexpression increased neurite growth and branching, and so did the phospho null spastin. On the other hand, the phospho mimetic prevents all kinds of neurite development.

      -Overexpression of GFP-spastin shows a turn-over of about 12 hours when protein synthesis is inhibited by cycloheximide. When 14-3-3 is co-overexpressed, GFP-spastin does not show a decrease by 12 hours. When S233A is expressed, a turn-over of 9 hours is observed, indicating that the ability to be phosphorylated increases the stability of the protein.

      -In support of that notion, the phospho-mimetic S233D makes it more stable, lasting as much as the over-expression of 14-3-3.

      -Authors show that spastin can be ubiquitinated, and that in the presence of ubiquitin, spastin-MT severing activity is inhibited.

      -By combining FCA with Spastazoline, the authors claim that FCA increased regeneration is due to increased spastin Activity in various models of neurite outgrowth and regeneration in cell culture and in vivo, the authors show impressive results on the positive effect of FCA in regeneration, and that this is abolished when spastin is inhibited.

      Major weaknesses

      -However convincing the pull-downs of the expressed proteins, the evidence would be stronger if a co-immunoprecipitation of the endogenous proteins were included.

      We thank the reviewer for their succinct summary of the main results and strengths of our study. We acknowledge the reviewers' valuable suggestions and agree that performing endogenous co-immunoprecipitation (co-IP) experiments in neurons is crucial for supporting our conclusions. To address this question, cortical neurons were cultured in vitro for endogenous IP experiment. The cortical neurons were cultured using a neurobasal medium supplemented with 2% B27, and using cytarabine to inhibit the proliferation of glial cells. The proteins were then extracted and subjected to the immunoprecipitation experiments using antibodies against spastin. The results, as shown in Fig.1C in the revised manuscript, clearly demonstrate that 14-3-3 protein indeed interacts with spastin within neurons.

      -To better establish the impact of spastin phosphorylation in the interaction, there is no indication that the phosphomimetic (S233D) can better bind spastin, and this result is contradicting to the conclusion of the authors that spastin-14-3-3 interaction is necessary for (or increases) spastin function.

      Thank you for your valuable and constructive comments. We agree with your consideration. To reinforce the importance of phosphorylated spastin in this binding model, we conducted additional experiments by transfecting S233D into 293T cells and performed immunoprecipitation experiments (Fig.2H). The results clearly demonstrate that spastin (S233D) exhibits enhanced binding to spastin, indicating that phosphorylation at the S233 site is critical for this interaction. Additionally, we observed that spastin (S233D) maintains its binding to 14-3-3 even in the presence of staurosporine. This data further supports and strengthens our conclusions.

      -To fully support the authors' suggestion that 14-3-3 and spastin work in the same pathway to promote regeneration, I believe that some key observations are missing.

      1-There is no evidence showing that 14-3-3 overexpression increases the total levels of spastin, not only its turnover.

      Thank you for your consideration and valuable input. We have previously demonstrated that overexpression of 14-3-3 leads to an increase in the protein levels of spastin in the absence of CHX (Fig.3E&F). Furthermore, we also observed an upregulated protein levels of spastin S233D compared to the wild-type (Fig.3G). We have now included these results in the revised manuscript.

      2- There is no indication that increasing the ubiquitination of spastin decreases its levels. To suggest that proteasomal activity is affecting the levels of a protein, one would expect that proteasomal inhibition (with bortezomib or epoxomycin), would increase its levels.

      Thanks for your concern. We believe that this evidence is critical. Indeed, another study by our team is working to elucidate the ubiquitination degradation pathway of spastin. In addition, a previous study has shown that phosphorylation of the S233 site of spastin can affect its protein stability (Spastin recovery in hereditary spastic paraplegia by preventing neddylation-dependent degradation, doi:10.26508/lsa.202000799.). To better support our conclusions, we have supplemented the results in Fig.3L&M. The results showed that the proteasome inhibitor MG132 could significantly increase the protein level of spastin, whereas CHX could significantly decrease the protein level of spastin, and the degradation of spastin is significantly hindered in the presence of both CHX and MG132. This experiment also further showed that ubiquitination of spastin reduced its protein level.

      3- Authors show that S233D increases MT severing activity, and explain that it is related to increased binding to 14-3-3. An alternative explanation is that phosphorylation at S233 by itself could increase MT severing activity. The authors could test if purified spastin S233D alone could have more potent enzymatic activity.)

      We appreciate the reviewer’s consideration. After investigating the interaction between 14-3-3 and spastin, we first aimed to determine whether the S233 phosphorylation mutation of spastin influenced its microtubule-severing activity. We found that overexpression of both S233A and S233D mutants resulted in significant microtubule severing (as indicated by a significant decrease in microtubule fluorescence intensity) (Fig.S2). Furthermore, it is noteworthy that S233 is located outside the microtubule-binding domain (MTBD, 270-328 amino acids) and the AAA region (microtubule-severing region, 342-599 amino acids) of spastin. Based on our initial observations, we believe that the phosphorylation of the S233 residue in spastin does not impact its microtubule-severing function. Additionally, under the same experimental conditions, we observed that the green fluorescence intensity of GFP-spastin S233D was significantly higher than that of GFP-spastin S233A. Based on these phenomena, we speculated that phosphorylation of the S233 residue of spastin might affect its protein stability, leading us to conduct further experiments. Furthermore, we fully acknowledge the reviewer's concern; however, due to technical limitations, we were unable to perform an in vitro assay to test the microtubule-severing activity of spastin. We have provided an explanation for this consideration in the revised version.

      -Finally, I consider that there are simpler explanations for the combined effect of FC-A and spastazoline. FC-A mechanism of action can be very broad, since it will increase the binding of all 14-3-3 proteins with presumably all their substrates, hence the pathways affected can rise to the hundreds. The fact that spastazoline abolishes FC-A effect, may not be because of their direct interaction, but because spastin is a necessary component of the execution of the regeneration machinery further downstream, in line with the fact that spastizoline alone prevented outgrowth and regeneration, and in agreement with previous work showing that normal spastin activity is necessary for regeneration.

      We appreciate the considerations raised by the reviewer. It is evident that spastin is not the exclusive substrate protein for 14-3-3, and it is challenging to demonstrate that 14-3-3 promotes nerve regeneration and recovery of spinal cord injury directly through spastin in vivo. However, we have identified the importance of 14-3-3 and spastin in the process of nerve regeneration. Importantly, we have conducted supplementary experiments to support the stabalization of spastin by FC-A treatment within neurons (Fig.4M), as well as the repair process of spinal cord injury in vivo (Fig.5D). The results showed that FC-A treatment in cortical neurons could enhance the stability of spastin protein levels, and we also demonstrated a consistent trend of upregulated protein levels of spastin and 14-3-3 following spinal cord injury. Moreover, the protein levels were significantly elevated in the the FC-A group of mice. These results also support that 14-3-3 enhances spastin protein stability to promote spinal cord injury repair. The manuscript was revised accordingly.

      Reviewer #2 (Public Review):

      Summary:

      The idea of harnessing small molecules that may affect protein-protein interactions to promote axon regeneration is interesting and worthy of study. In this manuscript, Liu et al. explore a 14-3-3-spastin complex and its role in axon regeneration.

      Strengths:

      Some of the effects of FC-A on locomotor recovery after spinal cord contusion look interesting.

      Weaknesses:

      The manuscript falls short of establishing that a 14-3-3-spastin complex is important for any FC-A-dependent effects and there are several issues with data quality that make it difficult to interpret the results. Importantly, the effects of the spastin inhibitor have a major impact on neurite outgrowth suggesting that cells simply cannot grow in the presence of the inhibitor and raising serious questions about any selectivity for FC-A - dependent growth. Aspects of the histology following spinal cord injury were not convincing.

      We sincerely appreciate the reviewer for evaluating our manuscript. Given the multitude of substrates that interact with 14-3-3, and considering spastin's indispensable role in neuroregeneration, it is indeed challenging to experimentally establish that FC-A's neuroregenerative effect is directly mediated through spastin in vivo. Therefore, we have provided additional crucial evidence regarding the changes in spastin protein levels following spinal cord injury, as well as the application of FC-A after spinal cord injury. Furthermore, we have made relevant adjustments to the uploaded images to enhance the resolution of the presented figures, as detailed in the subsequent response.

      Reviewer #3 (Public Review):

      Summary: The current manuscript c laims that 14-3-3 interacts with spastin and that the 14-3-3/spastin interaction is important to regulate axon regeneration after spinal cord injury.

      Strengths:

      In its present form, this reviewer identified no clear strengths for this manuscript.

      Weaknesses:

      In general, most of the figures lack sufficient quality to allow analyses and support the author's claims (detailed below). The legends also fail to provide enough information on the figures which makes it hard to interpret some of them. Most of the quantifications were done based on pseudo-replication. The number of independent experiments (that should be defined as n) is not shown. The overall quality of the written text is also low and typos are too many to list. The original nature of the spinal cord injury-related experiments is unclear as the role of 14-3-3 (and spastin) in axon regeneration has been extensively explored in the past.

      We sincerely appreciate the careful consideration and rigorous evaluation provided by the reviewer. In the revised version, we have made effort to present high-resolution figures and provide more detailed figure legends. Furthermore, we have made relevant adjustments to the statistical methods in accordance with the reviewer's suggestions. The manuscript has also undergone a thorough review and correction process to eliminate any writing-related errors. Please refer to the following response.

      To the best of our knowledge, there has been no clear reports on the efficacy of 14-3-3 in the repair of spinal cord injury. Kaplan A et al. (doi: 10.1016/j.neuron.2017.02.018) reported a reduction in die-back of the corticospinal tract following spinal cord injury using FC-A as a filler in situ in the lesion site. However, the specific effects of FC-A on spinal cord injury, such as motor function and neural reactivity, as well as the expression characteristic of 14-3-3 after spinal cord injury, have not been extensively elucidated. Additionally, prior research on spastin's role in axon regeneration primarily focused on the effects in Drosophila, and its regenerative effects in the central nervous system of adult mammals after injury have not been reported. Therefore, our study provides crucial insights into the importance of 14-3-3 and spastin in the process of spinal cord injury repair in mammals.

      Reviewer #1 (Recommendations For The Authors):

      There are many spelling and grammar errors, please revise. Examples:

      -approach revealed14-3-3

      -We have detected different many 14-3-3 peptides

      -Line 1057 (D) 14-3-3 agnoist FC-A

      -There is a discrepancy between panel names and figure legend in Figure 4.

      -There is another discrepancy between the color coding of treatments in Figure 7. All panels show "injury" in red and FC-A in orange, but in panel E, these are swapped. This is confusing to readers.

      Thank you for the thorough and rigorous review. We have re-colored the relevant chart. The manuscript has also undergone a thorough review to eliminate any writing-related errors.

      Most images from confocal microscopy are blurred or low resolution. They should be sharper for the type of microscopy used.

      We have adjusted and re-uploaded the images with higher resolution. Additionally, we have enlarged the relevant images.

      The list of all peptides retrieved in the Mass-Spec analyses of the GST-spastin pulldown must be publicly available, according to eLife rules.

      Thank you for your suggestion. We have now uploaded the mass spectrometry data.

      To determine where the 14-3-3/spastin protein142 complex functions in neurons, we double stained hippocampal neurons with spastin143 and 14-3-3 antibody, and found that 14-3-3 was colocalized with spastin in the entire144 cell compartment (Figure 1C).

      Colocalization by confocal fluorescence microscopy is not evidence for protein complexes.

      While co-localization experiments may not directly demonstrate protein-protein interactions, they can still provide valuable insights into the cellular localization of the proteins and suggest potential interactions between them. Therefore, we adjusted the statement.

      Fig1F- Co-immunoprecipitation assay results confirmed that all 14-3-3 isoforms could form direct complexes with spastin.

      CoIP in cells overexpressing the proteins is not evidence that it is direct. That they can interact directly with each other can be extracted from the evidence in vitro with purified proteins.

      We agree with this and we have changed our statement accordingly.

      For a broad audience to have a better understanding, the authors have to explain their a.a. subtitucions of Serine233, one being mimicking phosphorylation (S233D) and the other rendering the protein not being able to be phosphorylated in that position (S233A).

      We appreciate the suggestion. We have provided a more detailed explanation in revised manuscript.

      The panel of neuronas in Fig2G is mislabeled, because it is twice spastin S233A, instead of S233D.

      We apologize for this mistake and we have corrected it in the panel.

      FCA may increase the interaction of 14-3-3 with any of its substrates, including spastin. One would appreciate evidence that FCA increases the MT-severing activity of spastin, as assumed by authors

      We appreciate the reviewer’s suggestion. In this study, we overexpressed spastin to investigate its microtubule severing activity. It is important to note that overexpressing spastin significantly exceeds the normal physiological concentration of the protein. Using excessive amounts of FC-A to enhance the interaction between 14-3-3 and spastin in cells can lead to cell toxicity. Therefore, we chose to overexpress 14-3-3 instead of employing excessive FC-A.

      In Fig2F, the interaction of 14-3-3 with Spas-S233D would have been very informative.

      Thank you for the constructive suggestions from the reviewer. We have supplemented the corresponding co-immunoprecipitation experiments (Fig.).

      The functional effect of S233A and S233D does not correlate with a function of 14-3-3 in neurite outgrowth. This is because S233A does not interact with 14-3-3, however, it is as good as WT spastin... meaning that binding of 14-3-3 with spastin is not necessary...

      We appreciate the reviewer's consideration. The observed phenomenon of spastin WT and S233A promoting axon growth do not align with the physiological state within neurons. This may mask the true effects of S233A or S233D on neuronal axon growth. It is documented that the proper dosage of spastin is essential for neuronal growth and regeneration, as excessive or insufficient amounts can hinder axon growth. Excessive spastin levels can disrupt the overall cellular MTs. Therefore, spastin were moderately expressed by adjusting the transfection dosage and duration. Nevertheless, we were unable to precisely control the expression levels of spastin for both WT and S233A, also resulting in an overexpression state compared to the physiological state. As a result, the crucial role of spastin S233 in neural growth under physiological conditions may be masked. We have addressed this issue in the revised version of our manuscript.

      In panels 3C and D it is not clear if it does contain 14-3-3.... it seems it does not... but clarify.

      We apologize for any confusion. Since there is endogenous 14-3-3 present in the cells, we utilized spastin S233A and S233D to mimic the binding pattern with 14-3-3 according to the established interaction model. This information has been clarified in the original manuscript.

      Line 217 should indicate Figure 3, not Figure 5

      We have made the corresponding corrections.

      In F3G, it is intriguing that the input blot shows a decrease in Ubiquitin proteins when there is expression of flag ubiquitin...

      We apologize for the error in our presentation. In the control group, we actually overexpressed Flag-ubiquitin and GFP instead of Flag and GFP-spastin. Additionally, to further elucidate the impact of different phosphorylation states on spastin ubiquitination and degradation, we have conducted additional ubiquitination experiments (Fig.3N), which are now included in the revised version of our manuscript.

      S233 mutations seem to affect the effective turnover of spastin, but does not seem to change the levels of the spastin protein...hence, the conclusion that 14-3-3 protects from degradation is overstated.

      We thank the reviewers for the careful review and we have revised the statement accordingly.

      The mode of action of R18 FCA should be introduced earlier in the text.

      Thank you for the reviewer's correction. We have provided a corresponding description of the effects of FC-A and R18 on the interaction between 14-3-3 and spastin in the ubiquitination experiments section of the manuscript.

      Line 296 reads: Our results revealed that levels of 14-3-3 protein remained high even at 30 DPI, indicating that 14-3-3 plays an important role in the recovery of spinal cord injury.

      This is overstated since it can well be that an upregulated protein is inhibitory. We thank the reviewers for their consideration and we have made adjustments accordingly.

      It is not clear if 14-3-3 prevents ubiquitination of spastin, then its levels should be higher... it is noteworthy that they did not measure its levels in nerve tissue after injury. For example, in experiments shown in Figure 5A, it would have been very useful the observation of the levels of spastin.

      We appreciate the reviewer's consideration. We have now included the assessment of spastin protein levels following spinal cord injury. Additionally, we have collected the injured spinal cord lysates in mice treated with FC-A for western blot analysis. The results revealed that the expression trend of 14-3-3 protein is largely consistent with spastin after spinal cord injury. Furthermore, the treatment with FC-A was found to enhance the expression of spastin after spinal cord injury (Fig. 5C&D)."

      Panel 5G reads "nerve regeneration across the lesion site", but it actually measured NF levels, according to the legend.

      Thanks to the reviewers for the critical review. We have revised the chart accordingly.

      361 "BMS" should be explained in the results section for a better understanding of the results by non-experts.

      Thank you to the reviewers for their suggestions. We have explained this in the results section accordingly.

      Reviewer #2 (Recommendations For The Authors):

      1. The results of the mass spec and co-IP in Figure 1 are unclear.

      a) Are all of the peptides in Fig. 1A from 14-3-3 and were there only 3 14-3-3 peptides that were identified?

      The mass spectrum results did identify only three 14-3-3 peptides, and these three peptides were highly conserved across all isoforms.

      b) The blot in panel B needs to show the input band for spastin and 14-3-3 from the same gel and not spliced so that the level of enrichment can be evaluated in the co-IP.

      Thanks to the reviewer's comments, we have presented the whole gel (Fig.1B)

      c) Further, does an IP for 14-3-3 co-precipitate spastin?

      Thank you for your concern. We appreciate your feedback. Our 14-3-3 antibody is capable of Western blot experiments and recognizes all subtypes (Pan 14-3-3, Cell Signaling Technology, Cat #8312). Unfortunately, it is not suitable for immunoprecipitation (IP) experiments. Therefore, we have employed additional approaches, namely immunoprecipitation and pull-down assays, to further investigate the interaction between 14-3-3 and spastin.

      1. It is difficult to say anything about 14-3-3 - spastin co-localization in hippocampal neurons (1c) since 14-3-3 labels the entire hippocampal neuron so any protein will co-localize.

      We appreciate the comments. The co-localization experiments have provided evidence of the relative expression of both 14-3-3 and spastin in neurons, suggesting their potential interaction within neuronal cells. We have made the necessary revisions to accurately describe the results of the co-localization experiments in the manuscript.

      To further investigate the interaction between 14-3-3 and spastin within neurons, we have conducted additional co-immunoprecipitation (Co-IP) experiments using cortical neuron lysates (Fig.1C).

      1. The molecular weight of 14-3-3 is 25-28 kDa but the band in panel 1B and in subsequent figures it is below 15 kDa. Fig. 1F - the spastin band also seems to be low compared to predicted molecular weight and other W. Blot reports in the literature so some indication of how the antibody was validated would be important.

      Apologies for the mistakes. We have carefully re-evaluated the western blot images (See Author response image 1). We have confirmed that the molecular weight of the 14-3-3 protein is approximately 33 kDa. In the case of spastin, its molecular weight is around 55-70 kDa. Additionally, the GFP-spastin fusion protein has an estimated molecular weight of approximately 90 kDa. We have conducted a thorough verification and made appropriate adjustments to the molecular weight labels in all western blot images.

      Author response image 1.

      1. Fig 1G is a co-immunoprecipitation and it is not clear what the authors mean by "direct complexes" as claimed in line 150 of the results since this does not show direct binding between 14-3-3 and spastin. None of the assays in Fig. 1 assess "direct" binding between the two proteins and the authors should be clear in their interpretation.

      We agree with the reviewer's comments and have removed the word "direct" from the text.

      1. Fig. 1D - there is no validation that staurosporine (protein kinase inhibitor, not protein kinase as per typo in Line 167) affects the phosphorylation levels of spastin.

      Thank you for your valuable comments. In our group, we have conducted another study that has confirmed the involvement of CAMKII in mediating spastin phosphorylation. Furthermore, we have found that the addition of staurosporine significantly reduces the phosphorylation levels of spastin (unpublished results). In response to the reviewer's comment, we are pleased to provide western blot experiments demonstrating the effect of staurosporine on reducing spastin phosphorylation. The phosphorylation levels of spastin were assessed using a Pan Phospho antibody (Fig.2D).

      1. Fig. 2F - it would be important to test if spastin S233D interacts more robustly with 14-3-3 and if this is insensitive to staurosporine.

      Thank you for your comments. The suggestion provided by the reviewer is highly significant for supporting our conclusion that "phosphorylation of spastin is a prerequisite for its interaction with 14-3-3." Therefore, we have conducted additional immunoprecipitation experiments to further supplement our findings (Fig.2H). The experimental results demonstrate that the binding affinity between spastin S233D and 14-3-3 is stronger compared to spastin WT.

      1. Line 179 "Next, we transfected Ser233 mutation of spastin (spastin S233A or spastin S233D) with flag tagged 14-3-3 and generated Pearson's correlation coefficients. Results revealed that spastin 181 S233D was markedly colocalized with 14-3-3, with minimal colocalization with spastin S233A (Figure 2A-B)." Assuming the authors are referring to supplemental Figure 2, the 14-3-3 covers the entire cell thus I think measures of co-localization are uninterpretable.

      We agree with the reviewer's comment. We realize that 14-3-3θ exhibits a ubiquitous cellular distribution, which renders the measurement of its co-localization coefficients inconclusive. Therefore, we have decided to remove Supplementary Figure 2 from the manuscript.

      1. Line 189 "Consistent with earlier results, spastin promoted neurite outgrowth, as evidenced by both the length and total branches of neurite." - It is unclear what earlier results the authors are referring to. The authors should clarify how they determined the "moderate" expression level.

      We thank the review’s suggestions. The "earlier results" mentioned here refers to previously published articles, we now have added relevant references. Existing literature indicates that an appropriate dosage of spastin is necessary for neuronal growth and regeneration. However, both excessive and insufficient amounts of spastin are detrimental to axonal growth. Excessive spastin disrupts the overall microtubule network within cells. We controlled plasmid transfection dosage and transfection durations to achieve moderate expression. We have provided an explanation of these details in the revised version.

      1. The effects of WT spastin and spastin S233A were similar in spite of the fact that S233A does not bind to 14-3-3, which is inconsistent with the author's model that spastin-14-3-3 binding promotes growth. Line 191 - the authors mention that spastin S233D was toxic but I do not see any cell death measurements. I assume the bottom right panel in Fig. 2G labelled as spastin S233A is mislabeled and should be S233D.

      In response to comment 8, the transfection of both wild-type (WT) spastin and S233A mutant failed to precisely control the expression levels around the physiological concentration. Consequently, we observed an overexpression of spastin in both cases, which obscured the critical role of S233 phosphorylation in neurite outgrowth. We have addressed this issue in the revised version of the manuscript.

      1. Fig. 3. Does spastin(S233D) bind constitutively to 14-3-3? Why is spastin S233A not less stable than WT spastin based on the author's model?

      We propose that 14-3-3 is more likely to interact with spastin S233D in a non-constitutive manner. The instability of the S233A protein is attributed to the disruption of its ubiquitination degradation process due to the absence of 14-3-3 binding.

      1. The ubiquitin blot in Fig. 3G is not convincing and not quantified.

      We acknowledge the mislabeling in our figures. In the control group, Flag-Ubiquitin was also overexpressed, and we transfected GFP as a control instead of GFP-spastin. To further enhance the reliability, we conducted additional ubiquitination experiments (Fig.3N), which revealed a significant increase in spastin (S233A) ubiquitination levels compared to the WT group, consistent with previous research findings (Spastin recovery in hereditary spastic paraplegia by preventing neddylation-dependent degradation, doi:10.26508/lsa.202000799). Additionally, we observed that the addition of R18 could partially enhance spastin ubiquitination levels, as quantitatively illustrated in the figure (Fig.3O). This result further underscores the inhibitory role of 14-3-3 in the ubiquitination degradation pathway of spastin.

      1. I do not understand how the glutamate injury fits with the narrative (Fig. 4C).

      Excessive glutamate exposure can induce severe intracellular oxidative stress reactions, leading to the disruption of physiological processes such as mitochondrial energy production. This, in turn, results in the swelling and lysis of neuronal processes, a phenomenon known as neuronal necrosis. During this state, neurite maintenance is obstructed, and neurites exhibit swelling and breakage (Glutamate-induced neuronal death: a succession of necrosis or apoptosis depending on mitochondrial function. Neuron. 1995 Oct;15(4):961-73). We have provided a more comprehensive explanation of this phenomenon in the revised version of our manuscript.

      1. Some commentary about the selectivity of spastazoline to inhibit spastin should be included - it would be helpful if the authors could explain that this is a spastin inhibitor in the manuscript. FC-A still seems to promote growth in the presence of spastazoline suggesting that the FC-A effects are not dependent on spastin (Fig. 4E). The statistical analysis section of the materials and methods indicates that multiple groups were analyzed by one-way ANOVA. This seems unusual since the controls for cellular transfection are different than for small molecules (FC-A) and for peptides such as R18. As such, there is no vehicle control for the FC-A condition and it is difficult to assess the FC-A vs Spastazoline vs FA-A + Spastoazoline. The authors should clarify (Fig. 4E-J)

      Thank you for the reviewer’s suggestions. In the revised version, we have provided a more detailed explanation of the specific inhibition of spastin's severing function by spastazoline.

      We observed that FC-A, in combination with spastazoline, still exhibited a certain degree of promotion in neurite growth compared to the injury group under the glutamate circumstances. Evidently, spastin is not the exclusive substrate for 14-3-3, and FC-A might delay cellular oxidative stress reactions by facilitating the interaction of 14-3-3 with other substrates, such as the FOXO transcription factors as mentioned in the introduction. Nevertheless, our results still demonstrate that the addition of spastazoline significantly diminishes the promoting effect of FC-A on neurite growth, indicating that FC-A affects neuronal growth by impacting spastin.

      Furthermore, in the drug-treated groups, we overexpressed GFP to trace the morphology of neurons. Culture media were exchanged following transfection, and during media exchange, drugs were added. And an equivalent amount of DMSO or ethanol were added as controls to rule out the influence of solvents on neurons.

      1. There is a good possibility that spastin is required for all axon regeneration and that there is no selectivity for the FC-A pathway and this is a major issue with the interpretation of the manuscript (Fig 4K-L).

      We acknowledge this point. Clearly, spastin is not the exclusive substrate for 14-3-3, and our experimental evidence does not establish that 14-3-3 solely promotes neuronal regeneration through spastin. Nevertheless, we have identified the significance of 14-3-3 and spastin in the process of neural regeneration. Furthermore, we conducted complementary experiments to support the stability of spastin by FC-A treatment both in vitro and in vivo. We found an enhanced protein expression in cortical neurons after FC-A treatment (Fig.4M). Also, the results indicate a consistent elevation trend in the protein levels of spastin and 14-3-3 following spinal cord injury (Fig.5C&H). Moreover, in the FC-A group of mice, there was a significant increase in spastin protein levels (Fig.5D&I). These results also support that 14-3-3 promotes spinal cord injury repair by enhancing spastin protein stability.

      1. Fig. 5C- it is unclear where the photomicrographs were taken relative to the lesion.

      We obtained tissue sections from the lesion core and the above segments for histological analysis. Given the scarcity of neural compartment at the injury center, we select tissue slices as close as possible to lesion core to illustrate the relationship between 14-3-3 and the injured neurons. We have provided an explanation of this in the revised version of the manuscript.

      1. The authors need to provide some evidence that the FC-A and spastazoline compounds are accessing the CNS following IP injection.

      We thank the review’s suggestion. Although direct visualization evidence of FC-A and spastazoline entering the CNS is challenging to obtain, several indicators suggest drug penetration into spinal cord tissue. Firstly, behavioral and electrophysiological experiments in vivo demonstrate that drug injections indeed affect the neural activity of mice. Secondly, following spinal cord injury, the blood-spinal cord barrier was disrupted at the injury site, combined with the fact that both FC-A (molecular weight: 680.82 Da) and spastazoline (molecular weight: 382.51 Da) are small molecule drugs, these increases the likelihood of these small molecules entering the injured spinal cord tissue. Furthermore, our microtubule staining results indicated that FC-A and spastazoline did influence the acetylation ratio of microtubules. These findings support the drug penetration into spinal cord tissue.

      1. Some quantification of Fig. 5D would be important to support the contention that the lesion site is impacted by FC-A treatment.

      Thank you for the suggestion. We have included quantitative analysis for Figure 5D (Figure) as recommended.

      1. The NF and 5-HT staining in Fig. 5D and in Fig. 7A and B does not clearly define fibers and is not convincing.

      We appreciate the concerns. While we did not present whole nerve fibers, we therefore employed NF and 5-HT immunoreactive fluorescence intensity as an indicator to assess the regeneration of nerve fibers as previously described, but not axons per square millimeter (Baltan S, et, al. J Neurosci. 2011 Mar 16;31(11):3990-9; Iwai M, et, al. Stroke. 2010 May;41(5):1032-7; Wang Y, et, al. Elife. 2018 Sep 12;7:e39016; Altmann C, et, al. Mol Neurodegeneration. 2016 Oct 22;11(1):69).

      Our results showed that in the spinal cord injury group, there was strongly decreased NF-positive stainning (with a slight increase in 5-HT). In contrast, the FC-A treatment group exhibited a significant higher abundance of NF-positive signals (or an increased 5-HT signal) in the lesion site, which also suggests the reparative effect of FC-A on nerves. We also intend to refine our immunohistochemical methods in future experiments.

      Minor Comments: 1. Line 80 -84. To my knowledge the only manuscripts examining the effects of spastin in axon regeneration models includes the analysis in drosophila (i.e. ref 15 and 16) and a study in sciatic nerve that reported an index of functional recovery but did not perform any histology to assess axon regeneration phenotypes. The literature should be more accurately reflected in the introduction.

      We appreciate the suggestions from the reviewer. In the revised version, we have provided further clarification on the novelty of spastin in the spinal cord injury repair process.

      1. Line 73: The meaning of the following statement needs to be clarified: "spastin has two major isoforms, namely M1 and M87, coded form different initial sites."

      We have provided additional elaboration for this statement in the revised version.

      1. Line 216: Results indicated that GFP-spastin could be ubiquitinated, while inhibiting the 217 binding of 14-3-3/spastin promoted spastin ubiquitination (Figure 5G)." - Should be Fig 3G

      Sorry about the mistake. We have made the corresponding changes in the revised version.

      1. Line 255: "Briefly, we established a neural injury model as previously described(31)" - the basics of the injury model need to be described in this manuscript.

      In the revised version, we have provided further elaboration on the glutamate-induced neuronal injury model.

      Reviewer #3 (Recommendations For The Authors):

      Figure 1: A- Both legend and text fail to provide detail on this specific panel.

      We have provided a more detailed and comprehensive description of the legend and results in this section.

      B- Is the contribution of non-neuronal cells for co-IPs relevant? Co-IP with isolated neuronal extracts (instead of spinal cord tissue) should be performed.

      We thank the review’s suggestion. To further elucidate their interaction within neurons, cortical neurons were cultured (Cultured in Neurobasal medium supplemented with 2%B27 and cytarabine was used to inhibit glial cell growth) and cells were lysed for co-IP experiments (Fig.1C), and the results demonstrated the interaction between 14-3-3 and spastin within neurons.

      C- Both spastin and 14-3-3 appear to label the entire neuron with similar intensities throughout the entire cell which is rather unusual. Conditions of immunofluorescence should be improved and z-projections should be provided to support co-localization.

      Thanks for the comment. Our dual-labeling experiments indicated that 14-3-3 exhibits a characteristic pattern of whole-cell distribution. Therefore, this result cannot confirm the interaction between 14-3-3 and spastin within neurons, but it does provide evidence regarding the intracellular distribution patterns of 14-3-3 and spastin. Consequently, we supplemented neuronal endogenous co-IP experiments to further demonstrate the direct interaction between 14-3-3 and spastin within neurons, and we have modified the wording in the revised version accordingly.

      D- xx and yy axis information is either lacking or incomplete.

      We have made the corrections to the figures.

      E- It would be useful to show the conservation between the different 14-3-3 isoforms.

      We appreciate the suggestions. We have included a conservation analysis of 14-3-3 to assist readers in better understanding these results (Fig.1F).

      Figure 2:

      D- The experiment using a general protein kinase inhibitor does not allow concluding that the specific phosphorylation of spastin is sufficient for binding to 14-3-3. An alternative phosphorylated protein might be involved in the process.

      We appreciate the reviewer's consideration. We believe this serves as a prerequisite condition to demonstrate that "14-3-3 binding to spastin requires spastin phosphorylation." In fact, another project in our group has confirmed that CAMK II can mediate spastin phosphorylation, and the addition of staurosporine significantly reduces spastin phosphorylation levels (unpublished results). Here, we provide the western blot experiment showing the decrease in spastin phosphorylation under staurosporine treatment, with phosphorylation levels detected using the Pan Phospho antibody (Fig.2D).

      H and I- Pseudo-replication. Only independent experiments should be plotted and not data on multiple cells obtained in the same experiment. Please indicate the number of independent experiments.

      We appreciate the reviewer's correction. We now have included the mean value of three independent experiments and we have made relevant revisions to the statistical charts.

      Figure 3:

      The rationale for the hypothesis that spastin S233D transfection might upregulate the expression of spastin relative to WT and spastin S233A is unclear.

      We appreciate the reviewer's consideration. We have supplemented the relevant results, as depicted in the Fig.3G, which demonstrates that 14-3-3 can enhance the protein levels of spastin, and phosphorylated spastin (S233D) exhibits a significantly increased protein level compared to wild-type spastin. These findings indicate that 14-3-3 not only inhibits the degradation of spastin but also increases its protein levels.

      I- pseudo-replication. Please plot and do statistical analysis of independent experiments.

      Thank you for the reviewer's corrections. We have made the necessary revisions.

      Figure 4: E-J: I- pseudo-replication. Please plot and do statistical analysis of independent experiments.

      Thank you for the reviewer's corrections. We have made the necessary revisions.

      Figure 5:

      B- Please show individual data points.

      Thank you for the reviewer's corrections. We have made the necessary revisions.

      D- Longitudinal images of spinal cords where spastazoline was used cannot correspond to contusion as there is a very sharp discontinuity between the rostral and caudal spinal cord tissue. A full transection seems to have occurred. Alternatively, technical problems with tissue collection/preservation might have occurred.

      Thank you for the reviewer's consideration. The sharp discontinuity observed in the spastazoline group is not due to modeling issues but rather a result of the drug's effects on the injury site. This is primarily because spastin plays a crucial role not only in neuronal development but also in mitosis. Since the highly active proliferation of stromal cells at the injury site, . spastazoline may inhibit the proliferation of injury site-related stormal cells, thereby impeding the wound healing process following spinal cord injury, resulting in the observed discontinuous injury gap. We have made the corresponding revision accordingly.

      E- Images do not have the quality to allow analysis. 5HT staining should not be considered as a clear axonal labeling is not seen. This is also the case for neurofilament staining.

      We appreciate the concerns. While we did not present whole nerve fibers, we therefore employed NF and 5-HT immunoreactive fluorescence intensity as an indicator to assess the regeneration of nerve fibers as previously described, but not axons per square millimeter (Baltan S, et, al. J Neurosci. 2011 Mar 16;31(11):3990-9; Iwai M, et, al. Stroke. 2010 May;41(5):1032-7; Wang Y, et, al. Elife. 2018 Sep 12;7:e39016; Altmann C, et, al. Mol Neurodegeneration. 2016 Oct 22;11(1):69).

      Our results showed that in the spinal cord injury group, there was strongly decreased NF-positive stainning (with a slight increase in 5-HT). In contrast, our FC-A treatment group exhibited a significant higher abundance of NF-positive signals (or an increased 5-HT signal) in the lesion site, which also suggests the reparative effect of FC-A on nerves. We also intend to refine our immunohistochemical methods in future experiments.

      F- Images do not allow analysis. Higher magnifications are needed.

      Thank you for the reviewer's consideration. We have now included higher-magnification images (Fig.5M) to address this concern.

      Figure 7:

      Same issues as in Figure 5.

      A- Images do not have the quality to allow analysis. 5HT staining should not be considered as a clear axonal labeling is not seen.

      B- Images do not have the quality to allow analysis. Neurofilament staining should not be considered as clear axonal labeling is not seen. MBP staining does not have a pattern consistent with myelin staining

      We appreciate the concerns. While we did not present whole nerve fibers, we therefore employed NF and 5-HT immunoreactive fluorescence intensity as an indicator to assess the regeneration of nerve fibers as previously described, but not axons per square millimeter (Baltan S, et, al. J Neurosci. 2011 Mar 16;31(11):3990-9; Iwai M, et, al. Stroke. 2010 May;41(5):1032-7; Wang Y, et, al. Elife. 2018 Sep 12;7:e39016; Altmann C, et, al. Mol Neurodegeneration. 2016 Oct 22;11(1):69). In this study, sagittal slices were used. MBP covers the axonal surface, indicating its co-localization with the axons. However, as we did not present intact nerve fibers, so we were unable to show the typical myelin staining of MBP.

    1. Author Response:

      The following is the authors' response to the original reviews.

      We were pleased with the overall enthusiastic comments of the reviewers:

      • Reviewer #1: “This manuscript by Mahlandt, et al. presents a significant advance in the manipulation of endothelial barriers with spatiotemporal precision”

      • Reviewer #2: “The immediate and repeatable responses of barrier integrity changes upon light-on and light-off switches are fascinating and impressive.”

      • Reviewer #3: “, these molecular tools will be of broad interest to cell biologists interested in this family of GTPases.”

      We thank the reviewers for their fair and constructive comments that helped us to improve the manuscript.

      Reviewer #1 (Recommendations For The Authors):

      1) This paper is likely to attract a diverse audience. However, the order of data presented in this manuscript can be confusing or challenging to follow for the naive reader. This is because the tool characterization is split into two parts: before the barrier strength assay (selection of optogenetic platform and tool expression) and after (characterization of cell morphology with global and local optogenetic stimulation). Reorganizing the results such that the barrier strength results follows from an understanding of individual cell responses to stimulation may improve the ability of this readership to understand the factors at play in the changes in barrier strength observed when opto-RhoGEFs are activated.

      We appreciate this idea, and we initially structured the paper in the proposed order and then decided, that we wanted to put more focus on the barrier strength results by already presenting them in the second figure. Therefore, we prefer to keep this order of figures.

      2) While the description of the selection of iLID as the study's optogenetic platform is clear, a better job could be done motivating the need for engineering new optogenetic tools for the control of GEF recruitment. Given that iLID-based tools for GEFs of RhoA, Rac1, and Cdc42 already exist, some of which are cited in the introduction, more information on why these tools were not used would be helpful-were these tools tested in endothelial cells and found lacking.

      The original system has the domain structure DHPH-tagRFP-SspB. But we wanted to work with a SspB-FP-GEF construct, which would allow easy exchange of the FP and the DHPH domain. This modular approach allowed us to generate and compare the mCherry, iRFP647 and HaloTag version. We don’t want to claim that we engineered an entirely new optogenetic tool but rather optimized an existing one with different tags. To make this more clear we added : ‘The membrane tag of the original iLID was changed to an optimized anchor. In addition, we modified the sequence of the domains to SspB, tag, GEF to simplify the exchange of GEF and genetically encoded tag. A set of plasmids with different fluorescent tags was created for more flexibility in co-imaging.’

      3) Comment on the reason behind using DHPH vs. DH domains for each GEF is needed.

      We have previously found (and this is supported by biochemical analysis of GEF activity) that the selected domains provide the best activity. We will add reference and the following to the text: ‘Their catalytic active DHPH domains were used for ITSN1 and TIAM1 (Reinhard et al., 2019).  In case of p63 the DH domain only was used, because the PH domain of p63 inhibits the GEF activity (Van Unen et al., 2015) (Fig. 1E).

      4) Since multiple Rho GTPases (e.g., RhoA, RhoB, RhoC) exist and Rho is used as the name of the GTPase family, please use RhoA where applicable for clarity.

      Since the RhoGEFp63 will activate RhoA/B/C we would rather not refer to RhoA only. We will clarify this in the text: ‘Three GEFs were selected, ITSN1, TIAM1 and RhoGEFp63, which are known to specifically activate respectively Cdc42, Rac and Rho and their isoforms.’

      5) A brief comment on the use of HeLa cells for protein engineering and characterization (versus the endothelial cells motivated in the introduction) may be helpful.

      We added the following to the text: ‘HeLa cells were used for the tool optimization because of easier handling and  higher transfection rate in comparison to endothelial cells.

      Minor suggestions:

      In figure 1C, line sections showing intensity profiles before and after protein dimerization might further emphasize the change in biosensor localization.

      We are not a fan of intensity profiles as the profile depends strongly on the position of the line and it basically turns a 2D image in 1D data, for a single image. So, we prefer to stick to the quantification as shown in panel 1B (which shows data from multiple cells).

      Reviewer #2 (Recommendations For The Authors):

      1)The study has analyzed the effects of light-induced activation of the three optogenetic constructs in endothelial cells on their barrier function (electrical resistance) at high cell density and correlated the findings with the cellular overlap-producing effects on endothelial cells cultured at sparse cell density. It should be tried to show these effects at a cell density where these light-induced effects increase electrical resistance. Lifeact with different chromophores in adjacent cells might be useful.

      We had attempted to measure the overlap in a monolayer by taking advantage of the Halotag and the variety of dyes available by staining one pool of cells red with JF 552 nm and the other far red with the JF 635 nm dye. However, the cells need at least 24 h to form a monolayer and by then they had exchanged the dye and red and far red pool could not be distinguished any longer.

      Therefore, we used the Lck-mTq2-iLID construct, which already marks the plasma membrane of the cells. We created a mosaic monolayer of cells expressing mScarlet-CaaX and cells expressing Lck-mTq2-iLID + SspB-HaloTag-TIAM(DHPH). We observed and increase in the overlap between cells under this condition. The results have been added to figure 4 - figure supplement 2I&J. To the text we added:

      'Additionally, cell-cell membrane overlap increased about 20 %, up on photo-activation of OptoTIAM, in a mosaic expression monolayer (figure 4 - figure supplement 2I,J, Animation 22)‘

      2) The authors correctly state that some reports have shown that S1P can increase endothelial barrier function in VE-cadherin independent ways and these are related to Rac and Cdc42. This was also shown for Tie-2 in vitro and even in vitro in the absence of VE-cadherin and should also be mentioned.

      We added the following to the text: ‘Not only S1P promotes endothelial barrier independent from VE-cadherin, also Tie2 can increase barrier resistance in the absence of VE-cadherin (Frye et al. 2015).

      Since a blocking antibody against VE-cadherin was used, a negative control antibody should be tested which also binds to endothelial cells.

      To visualize the cell-cell junctions in the experiment shown in Supplemental Fig 3.1, we added a non-blocking VE-cadherin antibody that is directly labeled with ALEXA 647 and shows normal junction morphology. These experiments already give an indication that the live labeling antibody of VE-cadherin does not disturb the junction morphology. However, when we added the blocking antibody against VE-cadherin, known to interfere with the trans-interactions of VE-cadherin, a rapid disruption of the junctions is observed.

      Additionally, previous work has shown, that VE-cadherin labeling antibody does not interfere with junction dynamics and function (see Figure 2.A, Kroon et al. 2014 ‘Real-time imaging of endothelial cell-cell junctions during neutrophil transmigration under physiological flow’, jove.). We have added the figures below, showing that addition of the control IgG and VE-cadherin 55-7H1 Abs at the timepoint where the dotted line is, did not interfere with the resistance whereas the blocking Ab drastically reduced resistance. We have added this reference to the results. ‘Previous work has shown the specific blocking effect of this antibody in comparison to the VE-cadherin (55-7H1) labeling antibody (Kroon et al., 2014).’

      Author response image 1.

      Reviewer #3 (Recommendations For The Authors):

      Additional comments for the authors:

      1) The introduction is very long and would benefit from a more concise emphasis on the information required to put the work and results in context and understand their importance.

      Comment: we appreciate the comment of the reviewer. However, we wish to introduce the topic and the tools thoroughly and therefore we chose to keep the introduction as it is.

      2) The N-terminal membrane-binding domain does not homogeneously translocate to the plasma membrane, since lck is a raft-associated kinase. Please comment on this.

      In our hands, the Lck is among the most selective and efficient tags for plasma membrane localization (https://doi.org/10.1101/160374). We do observe homogeneous translocation, but our resolution is limited to ~200 nm and so we cannot exclude that the Lck concentrates in structures smaller than 200 nm. Given the robust performance of the lck-based iLID anchor in the optogenetics experiments, we think that the Lck anchor is a good choice.

      3) Figure 1D is not very clear. What does 25 or 36% change mean? If iLID tg is conjugated to these sequences, its cytosolic localization should be reduced versus iLID alone. Is this what the graph wants to express? If so, please, label properly the ordinate axis in the graph (% of non-tagged iLID values?)

      The graph is representing the recruitment efficiency of SspB to the plasma membrane for the two different membrane tags, targeting iLID to the plasma membrane. The recruitment efficiency was measured by the depletion of SspB-mScarlet intensity in the cytosol, up on light activation, and represented as a change in percentage.

      We added the following to the title of the graph_: SspB recruitment efficiency for Plasma Membrane tagged iLID._

      4) Supplemental figures in the main text. Fig S1D in the text refers to data in Fig S1E and Fig S1E is supposed to be Fig S1F? (page 11).

      That is correct. The mistakes have been corrected (and this is now renamed to figure 1 - figure supplement 1E and 1F).

      5) Figure 3. Contribution of VE-cadherin. Other junctional complexes, such as tight junctions may also intervene. However, these results would also suggest that cell-substrate adhesion rather than cell-cell junctions may modulate the barrier properties, as it has been previously demonstrated for example by imatinib-mediated activation of Rac1 (Aman et al. Circulation 2012). The ECIS system used to measure TEER in the quantitative barrier function assays can modulate these measurements and discriminate between paracellular permeability (Rb) and cell-substrate adhesion (alpha). Please, provide whether the optogenetic modulation of these GTPases does indeed regulate Rb or alpha.

      The measured impedance is made up of two components: capacitance and resistance. At relatively high AC frequencies (> 32,000 Hz) more current capacitively couples directly through the plasma membranes. At relatively low frequencies (≤ 4000 Hz), the current flows in the solution channels under and between adjacent endothelial cells’ (https://www.biophysics.com/whatIsECIS.php).

      Therefore, the high frequency impedance is representing cell-substrate adhesion whereas the low frequency responds more strongly to changes in cell-cell junction connections.

      We only measured at 4000 Hz, representing the paracellular permeability. We chose a single frequency to maximize time resolution.

      We have added this extra comment to the legend of the figure: ‘(B) Resistance of a monolayer of BOECs stably expressing Lck-mTurquoise2-iLID, solely as a control (grey), and either SspB-HaloTag-TIAM1(DHPH)(purple)/ ITSN1(DHPH) (blue) or p63RhoGEF(DH) (green) measured with ECIS at 4000 Hz, representing paracellular permeability, every 10 s.

    1. Author response:

      The following is the authors’ response to the original reviews.

      We appreciate your comments and suggestions on our manuscript.

      In particular, we have measured the affinity between the middle tail domain of myosin-5a (Myo5a-MTD) and the actin-binding domain of melanophilin (Mlph-ABD) using microscale thermophoresis, and obtained the Kd of ~0.56 uM, which is similar to the Kd of the globular tail domain of myosin-5a (Myo5a-GTD) to the GTD-binding motif of melanophilin (Mlph-GTBM). Moreover, we have performed Western blot of the lysate of transfected cells, showing that the proteins of the dominant negative construct and the negative control were expressed at similar lever without noticeable degradation.

      We appreciate the editors’ and reviewers’ comment on how melanophilin might be regulated in binding to the exon-G of myosin-5 and to actin filaments. Phosphorylation of melanophilin by protein kinase A is one possible mechanism. We will investigate this issues in our future study.

      We also took this opportunity to correct several minor errors in the manuscript. Textual alterations can be viewed in the “tracked change” version of the manuscript. Below is the comments from the editors and the two reviewers together with our point-by-point responses.

      eLife assessment

      This study represents a useful description of a third interaction site between melanophilin and myosin-5a which is important in regulating the distribution of pigment granules in melanocytes. While much of the data forms a solid case for this interaction, the inclusion of important controls for the cellular studies and measurement of interaction affinities would have been helpful.

      Public Reviews:

      Reviewer #1 (Public Review):

      Interactions known to be important for melanosome transport include exon F and the globular tail domain (GTD) of MyoVa with Mlph. Motivated by a discrepancy between in vitro and cell culture results regarding necessary interactions for MyoVa to be recruited to the melanosome, the authors used a series of pull-down and pelleting assays experiments to identify an additional interaction that occurs between exon G of MyoVa and Mlph. This interaction is independent of and synergistic with the interaction of Mlph with exon F. However, the interaction of the actin-binding domain of Mlph can occur either with exon G or with the actin filament, but not both simultaneously. These data lead to a modified recruitment model where both exon F and exon G enhance the binding of Mlph to auto-inhibited MyoVa, and then via an unidentified switch (PKA?) the actin-binding domain of Mlph dissociates from MyoVa and interacts with the actin filament to enhance MyoVa processivity.

      The only weakness noted is that the authors could have had a more complete story if they pursued whether PKA phosphorylation/dephosphorylation of Mlph is indeed the switch for the actin-binding domain of Mlph to interact with exon G versus the actin filament.

      We thank Reviewer #1 for careful reading of the manuscript and appreciation of the study. We agree with the Reviewer that it is important to understand how the actin-binding domain of Mlph switch its interaction with the exon-G of Myo5a and actin filament. We would like to pursue this direction in our future research.

      Reviewer #2 (Public Review):

      The authors identify a third component in the interaction between myosin Va and melanophilin- an interaction between a 32-residue sequence encoded by exon-g in myosin Va and melanophilin's actin-binding domain. This interaction has implications for how melanosome motility may be regulated.

      While this work is largely well done and certainly publishable following needed revisions (e.g. some affinity measurements, necessary controls for the dominant negative experiments), I believe that additional work would be required to make a more compelling case. First, the study provides just one more piece to a well-developed story (the role of exon-F and the GTD in myosin Va: melanophilin (Mlph) interaction), much of which was published 20 years ago by several labs. Second, the study does not demonstrate a physiological significance for their findings other than that exon-G plays an auxiliary role in the binding of myosin Va to Mlph. For example, what dictates the choice between Mlph's actin binding domain (ABD) binding to actin or to exon-G. Is it a PTM or local actin concentration? It is unlikely to be alternative splicing as exon-G is present in all spliced isoforms of myosin Va. And what changes re melanosome dynamics in cells between these two alternatives? Similarly, the paper does not provide any in vitro evidence that binding to exon-G instead of actin effects the processivity of a Rab27a/Myosin Va/Mlph transport complex. For example, if the ABD sticks to exon-G instead of actin, does that block Mlph's ability to promote processivity through its interaction with the actin filament during transport? In summary, given that the authors did not directly test their model either in vitro or in cells, I do not think this story represent a significant conceptual advance.

      We thank Reviewer #2 for careful reading of the manuscript and the suggestions of improving the manuscript. As suggested by the reviewer, we have measured the affinity between the middle tail domain of Myo5a (Myo5a-MTD) and Mlph-ABD (Kd ~0.562 uM), which is similar to that between the globular tail domain of Myo5a (Myo5a-GTD) and the GTBM of Mlph. In addition, we have performed additional experiments showing the integrity and the expression level of the dominant negative constructs in the transfected cells.

      We believe more extensive experiments are required to address other questions raised by the reviewer. For example, what dictates the choice between Mlph's actin binding domain (ABD) binding to actin or to exon-G is an open question. As we proposed, phosphorylation by protein kinase A is only one possible mechanism. We would like to pursue them in our future research.

      Recommendations for the authors:

      The reviewing editor feels strongly that addressing some of the points raised by the reviewers would make this a more compelling manuscript. In particular, a measurement of the affinity of the relevant fragments from melanophilin and myosin-5a would indicate that the interaction might be physiologically relevant. Concerning the dominant negative experiments, the lack of effect of an expressed fragment could be that the expressed fragments were simply degraded or expressed at too low of a level to be competing. The reviewer gives guidelines on how to address this. Reviewer #2 made a point that it would be compelling if the effect of phosphorylation as suggested in the model was tested, but we all agree that this could well be the subject of a later study. In addition, the authors make a very interesting proposal for how protein kinase A could be involved in this regulation as has been suggested previously. Perhaps the use of phosphomimetic mutations could give some insight into this. Such experiments, if consistent with the proposed model would certainly raise the impact of this study. Finally, a very clear periodicity in hydrophobic amino acids is apparent in the interacting sequences of both Myo5 (yrisLykrMidLmeqLekqdktVrkLkkqLkvFakkIgeLevgqmen) and Mlph (tdeeLseMedrVamtAseVqqAeseIsdIesrIaaLra). This is strongly suggesting a leucine-zipper-like coiled coil, rather than an interaction mediated solely by charge. Recent softwares (and easily accessible too) like AlphaFold multimer might yield important structural insight into the binding configuration and might help rationalize the effect of the mutations herein.

      We thank the editors and the reviewers for their suggestions of improving the manuscript. We have performed the several essential experiments to address the concerns raised by the reviewers.

      (1) Regarding the affinity of the relevant fragments from melanophilin and myosin-5a. We have measured the affinity between Mlph-ABD and Myo5a-MTD using MST (Kd ~562 nM) (see revised Figure 3A).

      (2) Regarding the concerns on the dominant negative experiments. We have examined the molecular sizes and expression levels of  Mlph or Myo5a constructs by Western blots. First, we show that all constructs have correct molecular size in transfected cells (see revised Figure 6C and 7D), indicating that the inability of Myo5a or Mlph truncations to generate dilute-like phenotypes was not due to the intracellular degradation of the EGFP fusion protein. Second, by correcting for the percentage of transfected cells, we show that the overall expression levels of the wild-type construct and the mutants are roughly equal. Third, we categorized the expression levels into high and low, and calculated percentage of the DN phenotype in high and low expression levels. The results are consistent with the percentage of DN phenotype in total EGFP fusion protein cells.

      (3) Regarding the suggestion to investigate the effect of phosphorylation by protein kinase A on Mlph-ABD’s interaction with Myo5a and actin filament. We understand that it is important to elucidate the mechanism by which the actin-binding domain of Mlph switch its interaction with the exon-G of Myo5a and actin filament. However, as we proposed, phosphorylation by protein kinase A is one possible mechanism, and more extensive experiments are required to address this question. Therefore, we would like to pursue it in our future research.

      (4) Regarding the suggestion to predict the interaction between the exon-G of myosin-5a and Mlph-ABD using AlphaFold. We have used AlphaFold multimer to predict the Myo5a-MTD/Mlph-ABD interaction. Remarkably, the AlphaFold predicted that the binding of Myo5a-MTD with Mlph-ABD is mediated by an antiparallel coiled-coil formed by Myo5a (1430-1467) and Mlph (450-481), just as predicted by the editors. This prediction is also consistent with our finding that the exon-G of Myo5a interacts with Mlph-ABD. However, the predicted model cannot explain our mutagenesis results. We will pursue this point in the future research. Nevertheless, we are grateful to the editors for bringing this idea to our attention, because it will help us to design experiments to investigate the nature of Myo5a-exon-G/Mlph-ABD interaction.

      Reviewer #1 (Recommendations For The Authors):

      Specific minor comments

      Q1: In figs 6-7 an overlay between DAPI and EGFP would be helpful for the reader to see perinuclear distribution.

      As suggested, we have added the merged images of DAPI and EGFP in the revised Figure 6 and 7.

      Q2: The delta symbol in the pdf text was corrupted.

      The corrupted delta symbol has been fixed in the revised manuscript.

      Reviewer #2 (Recommendations For The Authors):

      Q1: Please explain in detail early in the text what exon-G is - length, position in the tail, and evidence that it is a coiled coil (CC). Of note, is it only long enough for about 4 heptad repeats? Has it been shown biochemically to form a CC? Is the CC irreversible? What would be the consequence of removing the exon-G CC on the ability of surrounding regions to bind Mlph (exon-F and the GTD)?

      We thank the reviewer for this suggestion. In the revision, we added a new paragraph (the first paragraph in the results section) and revised Figure 1A to introduce the middle tail domain and alternatively spliced exons of Myo5a.

      Exon-G is 32 amino acids in length, located at the C-terminal region of the middle tail domain, immediately before the globular tail domain. Exon-G region was predicted to form a short coiled-coil by using on-line tools (such as paircoil), and this prediction has not been tested biochemically. Moreover, we do not know whether the exon-G coiled-coil is reversible or not.

      We have not examined the effect of removing the whole exon-G on the interaction between the GTD and Mlph-GTBM. The exon-G (residues 1436-1467) and the GTD core (residues 1498-1877) are separated by a long loop of 31 residues. We therefore expect that the removing the exon-G will not affect the GTD/Mlph-GTBM interaction.

      Physically, exon-F is immediately followed by exon-G, and those two regions might interfere with each other. In our preliminary study, we found that removing the whole exon-G abolished the interaction between exon-F and Mlph-EFBD. On the other hand, removing the C-terminal half (residues 1454-1467) of exon-G had little effect the interaction between exon-F and Mlph-EFBD (see Figure 2C). In this work, we intentionally selected the later construct for functional analysis of the exon-G/Mlph-ABD interaction, because removing the C-terminal half of exon-G abolishes the interaction with Mlph-ABD, but does not affect the exon-F/Mlph-EFBD interaction.

      Q2: Figures 1-3. While the pulldown experiments demonstrating an interaction between Mlph-ABD residues 446-571 and Myo5a-MTD are a good start, one would like to see affinity measurements to gauge the likelihood that this interaction is physiologically relevant. The same goes for the pulldown experiments demonstrating an interaction between (i) the C-terminal half of exon-G (residues 1453-1467) and the Mlph-ABD, (ii) between residues 1411-1467 (a short peptide containing exon-F and exon-G) and the Mlph-ABD, and (iii) between residues 1436-1467 (a short peptide containing exon-G) and the Mlph-ABD. This would also apply to the pulldowns in 3C-3E where versions of the proteins with charge residue changes were tested.

      We agree the reviewer’s opinion that determination of the affinities between Mlph-ABD and Myo5a-MTD and their variants will be helpful in understanding the physiological relevance of Exon-G/Mlph-ABD interaction. However, the extensive experiments suggested by the reviewer require many high quality, purified proteins, which are not trivial.

      Nevertheless, we think it is important to know the affinity between Myo5a-MTD and Mlph-ABD (both wild-type), as this parameter can be used for the comparison of the three interactions between Myo5a and Mlph. Therefore, we have obtained the affinity between Myo5a-MTD and Mlph-ABD using microscale thermophoresis (MST). The dissociation constant (Kd) of Myo5a-MTD to Mlph-ABD is 0.562±0.169 uM, which is similar to that between Myo5a-GTD and Mlph-GTBM (~1 uM) (Geething & Spudich (2007) JBC 282:21518). Consistent with GST pulldown results, MST shows that deletion of C-terminal half of exon-G (1453-1467) greatly decreases the MST signals (see revised Figure 3A).

      Q3: While the domain negative (DN) approach to testing functional significance is OK, rescuing dilute/myosin Va null melanocytes with full-length myosin Va containing the various deletions would have been more convincing. Also, the authors must show (i) that the DN constructs are the correct size in transfected cells (i.e. are not degraded), and (ii) that they are expressed at roughly equal levels (either by doing Westerns and correcting for the percent of transfected cells, or by measuring total cellular fluorescence in transfected cells). Without this information, it remains possible that constructs not exhibiting a DN effect are simply degraded or poorly expressed. This applies to all the DN data in Figures 6 and 7.

      We agree with the reviewer that Myo5a null melanocytes is ideal for investigating exon G function. Unfortunately, we do not have Myo5a null melanocytes derived from dilute mice.

      To confirm the integrity of the overexpressed proteins in the transfected cells, we performed Western blot of those proteins, including  EGFP-Mlph-RBD (wild-type and two mutants) and Myo5a-Tail (wild-type and G mutant), in the lysate of the transfected cells. Western blots show that all those proteins have correct molecular masses, indicating no degradation of those overexpressed proteins (see revised Figure 6C and 7C). Moreover, by correcting for the percentage of transfected cells, we show that the overall expression levels in each transfected cell of the wild-type construct and the mutants are roughly equal. This information is included in the revised manuscript (Line 222-225; 237-241).

      Q4: The authors scored the DN phenotype as yes/no but it mostly likely varies depending on the degree of over-expression. Showing that the degree of melanosome centralization scales with the degree of overexpression, and that the correlation between expression level and phenotype varies depending on the construct would strengthen the results.

      We agree with the reviewer’s prediction that the degree of DN phenotype should depend on the of over-expression level. We analyzed the EGFP signals of transfected cells and found very few cells with medium expression level. Therefore, we simply categorized the expression levels into high and low, and calculated the DN phenotype in each categories as shown in the table below. These results are consistent with the expectation that the degree of DN phenotype depends on the over-expression level of the transfected constructs.

      Author response table 1.

      Percentage of the EGFP-expressing cells with perinuclear aggregation of melanosomes

      Q5: The conclusion from the data in Figure 8A- "the presence of both exon-F and exon-G is insufficient for binding to the Mlph occupied by Myo5a, but sufficient for binding to the unoccupied Mlph"- should be verified by also doing the experiment in myosin Va knockdown cells.

      We agree. Unfortunately, our RNAi knockdown of Myo5a in melanocytes by RNAi is not ideal and we do not have Myo5a knockout melanocytes. We will pursue this point in the future.

      Q6: Line 213 "three Mlph-binding regions, i.e., exon-F, exon-F, and GTD (Figure 7A)" has a typo.

      This typo has been corrected.

      Q7: The authors should provide high mag insets for the images in Figure 8.

      As suggested, we have revised Figure 8 by including high mag insets for the images.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1:

      In this manuscript by Napoli et al, the authors study the intracellular function of Cytosolic S100A8/A9 a myeloid cell soluble protein that operates extracellularly as an alarmin, whose intracellular function is not well characterized. Here, the authors utilize state-of-the-art intravital microscopy to demonstrate that adhesion defects observed in cells lacking S100A8/A9 (Mrp14-/-) are not rescued by exogenous S100A8/A9, thus highlighting an intrinsic defect. Based on this result subsequent efforts were employed to characterize the nature of those adhesion defects.

      The authors thank reviewer #1 for his/her insightful comments and suggestions. Please find our point to point responses below.

      (1) Ex vivo characterization of the function of S100A8/A9 in adhesion, spreading, and calcium signaling requires at least one rescue experiment to support the direct role of these proteins in the biological processes under study.

      We thank the reviewer for this comment. We agree that rescue experiments would be helpful to confirm the direct role of intracellular S100A8/A9 in adhesion, spreading, and Ca2+ signaling. Although transfection of primary cells, especially neutrophils, poses challenges due to their short half-life, we now have undertaken additional in vitro rescue experiments. Specifically, we used extracellular S100A8/A9 and coated Ibidi flow chambers with E-selectin, ICAM-1 and CXCL1 alone or alongside S100A8/A9, and measured rolling and adhesion of blood neutrophils. Our data reveal that extracellular S100A8/A9 can induce increased adhesion in WT neutrophils but fails to rescue the adhesion defect in Mrp14-/- neutrophils (Author response image 1). This result corroborates our in vivo findings, emphasizing that the observed adhesion defect is due to the lack of intracellular S100A8/A9.

      Author response image 1.

      Extracellular S100A8/A9 does not rescue the adhesion defect in Mrp14/- neutrophils. Analysis of number of adherent leukocytes FOV-1 normalized to the WBC of WT and Mrp14-/- mice. Whole blood was harvested through a carotid artery catheter and perfused with a high precision pump at constant shear rate using flow cambers coated with either E-selectin, ICAM-1 and CXCL1 or E-selectin, ICMA-1, CXCL1 and S100A8/A9. [mean+SEM, n=5 mice per group, 12 (WT) and 14 (Mrp14-/-) flow chambers, 2way ANOVA, Sidak’s multiple comparison]. ns, not significant; *p≤0.05, **p≤0.01, ***p≤0.001.

      (2) There is room for improvement in the analysis of signaling pathways presented in Figures 3 H and I. Western blots and analyses are not convincing, in particular for p-Pax.

      We acknowledge the reviewer's concern regarding the clarity of the signaling pathway analysis, particularly the western blots for p-Paxillin. To address this, we have repeated the western blot experiments using murine neutrophils. Our new data confirm the defective paxillin phosphorylation upon CXCL1 stimulation and ICAM-1 binding in the absence of cytosolic S100A8/A9. We have now integrated these new findings with the original data and included the updated results in the manuscript (Figure 3I revised). These enhanced analyses provide a more robust and convincing demonstration of the signaling defects in Mrp14-/- neutrophils.

      (3) At least one western blot showing a knockdown of S100A8/A9 should be included towards the beginning of the result section.

      We appreciate the reviewer's suggestion to include a western blot demonstrating the knockout of S100A8/A9 early in the results section. In a recent publication by our group, we have already demonstrated the absence of S100A8/A9 at the protein level in Mrp14-/- neutrophils via western blotting ([1], please refer to Extended Data Fig. 1h). We agree that visual confirmation of the absence of S100A8/A9 protein is crucial for establishing the validity of our study.

      (4) The Ca2+ measurements at LFA-1 nanoclusters using the Mrp14-/- Lyz2xGCamP5 are interesting; It is understood that the authors are correcting calcium levels by normalizing by LFA-1 cluster areas and that seems fine to me. The issue is that the total calcium signal seems decreased in Mrp14-/- cells compared to WT cells (Fig. 4E)...why is totalCa2+ low? Please discuss.

      We thank the reviewer for this insightful comment. Indeed, our observations reveal reduced overall Ca2+ levels in Mrp14-/- neutrophils compared to WT neutrophils. Initially, we noticed a general decrease in Ca2+ intensity (Author response image 2A-B) and lifetime in Mrp14-/- neutrophils (Author response image 2C-D). Further analysis indicated that these differences in Ca2+ levels are localized specifically to the LFA-1 nanocluster sites. In contrast, the cytosolic Ca2+ levels outside of the LFA-1 nanocluster areas were comparable between Mrp14-/- and WT neutrophils (Figure 4H-J). This suggests that the reduced total Ca2+ levels observed in Mrp14-/- neutrophils are primarily due to the impaired Ca2+ supply at the LFA-1 nanocluster areas. Our data support the notion that cytosolic S100A8/A9 plays a crucial role in actively supplying Ca2+ to LFA-1 nanoclusters during neutrophil crawling. In the absence of S100A8/A9, the increase in overall Ca2+ levels (summing both inside and outside LFA-1 nanocluster areas) is minimal, further highlighting the specific role of S100A8/A9 in maintaining localized Ca2+ concentrations at these crucial sites.

      Author response image 2.

      Overall Ca2+ levels in WT and Mrp14-/- neutrophils (A) Representative confocal images of neutrophils from WT Lyz2xGCaMP5 and Mrp14-/- Lyz2xGCaMP5 mice, labeled with Lyz2 td Tomato marker. The images illustrate overall cytosolic Ca2+ levels during neutrophil crawling flow chambers coated with E-selectin, ICAM-1, and CXCL1 (scale bar=10μm). (B) Quantitative analysis of total cytosolic Ca2+ intensity in single cells from WT Lyz2xGCaMP5 and Mrp14-/- Lyz2xGCaMP5 neutrophils measured over three time intervals: min 0-1, 5-6 and 9-10 [mean+SEM, n=5 mice per group, 56 (WT) and 54 (Mrp14-/-) neutrophils, 2way ANOVA, Sidak’s multiple comparison]. (C) Representative traces and (D) single cell analysis of total Ca2+ lifetime over the first 5 minutes in WT Lyz2xGCaMP5 and Mrp14-/- Lyz2xGCaMP5 neutrophils crawling on Eselectin, ICAM-1, and CXCL1 coated flow chambers recorded with FLIM microscopy [mean+SEM, n=3 mice per group, 111 (WT) and 95 (Mrp14-/-) neutrophils, 2way ANOVA, Sidak’s multiple comparison]. ns, not significant; *p≤0.05, **p≤0.01, ***p≤0.001.

      (5) Even if the calcium level outside LFA-1 nanoclusters is not significant (Figure 4J), the data at min 9-10 in Figure 4J seems to be affected by a single event that may be an outlier. Additional data may be needed here.

      We appreciate the reviewer’s attention to this detail. To address the concern regarding a potential outlier in the Ca2+ level measurements at 9-10 minutes in Figure 4J, we rigorously tested the dataset using the GraphPad outlier calculator. The analysis revealed that no data point was statistically identified as an outlier. Given that the current dataset is robust and the statistical analysis confirms the integrity of the data, we believe that the results accurately reflect the biological variability observed in our experiments. Therefore, we have not added additional data points at this stage but remain open to discussing this further.

      (6) Finally, even though there is less calcium at LFA-1 clusters, that does not necessarily mean that "cytosolic S100A8/A9 plays an important role in Ca2+ "supply" at LFA-1 adhesion spots" as proposed. S100A8/A9 may play an indirect role in calcium availability. The analysis of the subcellular localization of S100A8/A9 at LFA-1 clusters together with calcium dynamics in stimulated WT cells would help support the authors' interpretation, which although possibly correct, seems speculative at this point.

      We thank the reviewer for this insightful comment and fully agree that additional evidence regarding the subcellular localization of S100A8/A9 would strengthen our conclusions. Although live cell imaging of intracellular S100A8/A9 was initially challenging due to technical limitations, we have now performed additional experiments to address this issue. We conducted end-point measurements where we allowed WT neutrophils to crawl on E-selectin, ICAM-1, and CXCL1 coated flow chambers for 10 minutes. Following this, we fixed and permeabilized the cells to stain intracellular S100A9, along with LFA-1 and a cell tracker for segmentation. Confocal microscopy and subsequent single-cell analysis revealed a significant enrichment of S100A8/A9 at LFA-1 positive nanocluster areas compared to the surrounding cytosol (Figure 4K and 4L, new). This finding supports our hypothesis that S100A8/A9 plays a direct role in the localized supply of Ca2+ at LFA-1 adhesion spots, thus facilitating efficient neutrophil crawling under shear stress. These new data have been included in the revised manuscript, providing stronger evidence for our proposed mechanism.

      Reviewer #2:

      Napoli et al. provide a compelling study showing the importance of cytosolic S100A8/9 in maintaining calcium levels at LFA-1 nanoclusters at the cell membrane, thus allowing the successful crawling and adherence of neutrophils under shear stress. The authors show that cytosolic S100A8/9 is responsible for retaining stable and high concentrations of calcium specifically at LFA-1 nanoclusters upon binding to ICAM-1, and imply that this process aids in facilitating actin polymerisation involved in cell shape and adherence. The authors show early on that S100A8/9 deficient neutrophils fail to extravasate successfully into the tissue, thus suggesting that targeting cytosolic S100A8/9 could be useful in settings of autoimmunity/acute inflammation where neutrophil-induced collateral damage is unwanted.

      The authors appreciate reviewer #2's insightful comments and suggestions. Below are our detailed responses:

      (1) Extravasation is shown to be a major defect of Mrp14-/- neutrophils, but the Giemsa staining in Figure 1H seems to be quite unspecific to me, as neutrophils were determined by nuclear shape and granularity. It would have perhaps been more clear to use immunofluorescence staining for neutrophils instead as seen in Supplementary Figure 1A (staining for Ly6G or other markers instead of S100A9).

      We acknowledge the reviewer's concern. However, Giemsa staining is a well-established method in hematology, histology, cytology, and bacteriology, widely recognized for its ability to distinguish leukocyte subsets based on nuclear shape and cytoplasmic characteristics. This method is extensively documented in the literature [2-5]. Its advantages are the easy morphological discrimination of leukocytes based on nuclear and cytoplasmic shape and conformation (Author response image 3).

      Author response image 3.

      Giemsa staining of extravasated leukocyte subsets. (A) Representative image of Giemsa-stained cremaster muscle tissue post-TNF stimulation. The image clearly differentiates leukocyte subsets (white arrow = neutrophils, yellow arrow = eosinophils, red arrow = monocytes). Scale bar = 50µm.

      (2) The representative image for Mrp14-/- neutrophils used in Figure 4K to demonstrate Ripley's K function seems to be very different from that shown above in Figures 4C and 4F.

      The reviewer correctly observed that the cell in Figure 4K is different from those in Figures 4C and 4F. This is intentional, as Figure 4K is meant to show a representative image that accurately reflects the overall results of the experiments. We assure the reviewer that all cells analyzed in Figures 4C and 4F were also included in the analysis for Figure 4K.

      (3) Although the authors have done well to draw a path linking cytosolic S100A8/9 to actin polymerisation and subsequently the arrest and adherence of neutrophils in vitro, the authors can be more explicit with the analysis - for example, is the F-actin co-localized with the LFA-1 nanoclusters? Does S100A8/9 localise to the membrane with LFA-1 upon stimulation? Lastly, I think it would have been very useful to close the loop on the extravasation observation with some in vitro evidence to show that neutrophils fail to extravasate under shear stress.

      We thank the reviewer for this comment and questions. 

      Concerning the co-localization of F-actin with LFA-1 nanoclusters and S100A8/9 localization: We appreciate the reviewer's interest in the co-localization between F-actin and LFA-1. Unfortunately, due to the limitations of our GCaMP5 mouse model (with neutrophils labeled with td-Tomato and eGFP for LyzM and Ca2+), we could only stain for either LFA-1 or F-actin at a time. However, in our F-actin movies, we observed that F-actin predominantly localizes at the rear of the cell, while LFA-1 is more uniformly distributed at the plasma membrane.

      Regarding S100A8/A9 localization, as mentioned in response to Reviewer 1's sixth point, we now conducted endpoint measurements. We stained neutrophils with cell tracker green CMFDA and LFA-1, allowed them to crawl on E-selectin, ICAM-1, and CXCL1-coated flow chambers, and then performed intracellular S100A9 staining after fixation and permeabilization. Our analysis shows higher S100A9 intensity at LFA-1 positive areas compared to LFA-1 negative areas (Figure 4K and 4L, new). This indicates that S100A8/A9 indeed concentrates Ca2+ at LFA-1 nanoclusters, supporting adhesion and post-arrest modification events under flow.

      Regarding the extravasation defect under shear stress: To address the reviewer's suggestion, we performed transwell migration assays under static conditions. Our results show no significant difference in transmigration between WT and Mrp14-/- neutrophils without flow, indicating that the extravasation defect in Mrp14-/- neutrophils is shear-dependent. This supports our hypothesis that S100A8/A9-mediated Ca2+ supply at LFA-1 nanoclusters is critical under flow conditions (Author response image 4).

      Author response image 4.

      Static Transmigration assay. (a) Transmigration of WT and Mrp14-/- neutrophils in static transwell assays (3um pore size, 45min migration time) showing spontaneously migration (PBS) or migration towards CXCL1. [mean+SEM, n=3 mice per group, 2way ANOVA, Sidak’s multiple comparison]. ns, not significant; *p≤0.05, **p≤0.01, ***p≤0.001.

      Additional References

      (1) Pruenster, M., et al., E-selectin-mediated rapid NLRP3 inflammasome activation regulates S100A8/S100A9 release from neutrophils via transient gasdermin D pore formation. Nature Immunology, 2023. 24(12): p. 2021-2031.

      (2) Kuwano, Y., et al., Rolling on E- or P-selectin induces the extended but not high-affinity conformation of LFA-1 in neutrophils. Blood, 2010. 116(4): p. 617-24.

      (3) Porse, B., Mouse Hematology – A Laboratory Manual. European Journal of Haematology, 2010. 84(6): p. 554-554.

      (4) Frommhold, D., et al., Protein C concentrate controls leukocyte recruitment during inflammation and improves survival during endotoxemia after efficient in vivo activation. Am J Pathol, 2011. 179(5): p. 2637-50.

      (5) Braach, N., et al., RAGE Controls Activation and Anti-Inflammatory Signalling of Protein C. PLOS ONE, 2014. 9(2): p. e89422.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      In the presented study, the authors aim to explore the role of nociceptors in the fine particulate matter (FPM) mediated Asthma phenotype, using rodent models of allergic airway inflammation. This manuscript builds on previous studies and identify transcriptomic reprogramming and an increased sensitivity of the jugular nodose complex (JNC) neurons, one of the major sensory ganglia for the airways, on exposure to FPM along with Ova during the challenge phase. The authors then use OX-314 a selectively permeable form of lidocaine, and TRPV1 knockouts to demonstrate that nociceptor blocking can reduce airway inflammation in their experimental setup. The authors further identify the presence of Gfra3 on the JNC neurons, a receptor for the protein Artemin, and demonstrate their sensitivity to Artemin as a ligand. They further show that alveolar macrophages release Artemin on exposure to FPM.

      We thank the reviewer for their valuable comments, which have significantly enhanced the quality of our manuscript. A point-by-point rebuttal is provided below.

      Strength

      The study builds on results available from multiple previous work and presents important results which allow insights into the mixed phenotypes of Asthma seen clinically. In addition, by identifying the role of nociceptors, they identify potential therapeutic targets which bear high translational potential.

      Weakness

      While the results presented in the study are highly relevant, there is a need for further mechanistic dissection to allow better inferences. Currently certain results seem associative. Also, certain visualisations and experimental protocols presented in the manuscript need careful assessment and interpretation. While Asthma is a chronic disease, the presented results are particularly important to explore Asthma exacerbations in response to acute exposure to air pollutants. This is relevant in today's age of increasing air pollution and increasing global travel.

      Major

      The JNC is a major group of neurons responsible for receiving sensory inputs from the airways. However, the DRG also contains nociceptors and is known to receive afference from the upper airways. An explanation of why the study was restricted to the JNC would be important.

      We acknowledge that some afferents to the upper airways do arise from the DRG, specifically in the upper thoracic segments (T1–T5). We have added a statement in the text to note this subset of nociceptive and spinally mediated pathways. However, the preponderance of evidence indicates that the majority of airway and lung afferents (70–80%, sometimes up to 90%) originate from the jugular–nodose complex (JNC). Given this large imbalance—and because our study focuses on the mechanosensory, and chemosensory functions mediated primarily by the JNC—we restricted our analysis to this main vagal pathway. By contrast, DRG innervation, though functionally important for nociception and irritation-related reflexes, accounts for a smaller yet significant (~20–30%) fraction of the total afferent pool. The referenced tracing studies[1,2] support this distribution and are cited to clarify our rationale for emphasizing the JNC in our work.

      Similarly, the role of the Artemin in the study remains associative. The study results present that Artemin sensitize nociceptors to lead to an increased inflammatory response (Supplementary Figure 2), however, both upstream and downstream evidence for this inference needs to be dissected further. For instance, the evidence for the role of Artemin in the model comes from ex vivo experiments with alveolar macrophages, but not in the experimental model created. Blocking or activation experiments could be performed, along with investigating the change in the total number of nociceptors with Artemin exposure. Similarly, the downstream effects of the potential Artemin-mediated JNC stimulation should be explored in the context of this experimental setup. A detailed dissection of the mechanisms is important. Additionally, it is also important to discuss the hypothesis leading to the selection of Artemin as a target, which currently seems arbitrary.

      Our data show that exogenous i) OVA-FPM exposed AM secrete Artemin and that ii) recombinant Artemin can sensitize nociceptors, potentially heightening the inflammatory response. As suggested, we agree that more upstream and downstream evidence is needed for definitive mechanistic insight. In response, we have expanded our experiments to include intravital microscopy, which demonstrates impaired motility of alveolar macrophages and neutrophils in nociceptor-ablated mice, suggesting a bidirectional crosstalk between AMs and nociceptor neurons.  

      In future studies, we will perform blocking or activation studies to clarify Artemin’s in vivo effects and confirm its role in modulating airway nociceptors. We also recognize the importance of examining whether Artemin exposure alters the phenotype of these neurons and lung innervation density. As recommended, we plan targeted interventions (e.g., Artemin-neutralizing antibodies or overexpression strategies) to delineate the mechanisms by which Artemin-mediated nociceptor stimulation influences the local inflammatory environment.

      We have expanded our discussion to clarify that Artemin is a recognized growth factor known to sensitize certain sensory neurons, including those responsive to tissue injury and inflammation. This literature-based rationale guided our hypothesis that Artemin might increase nociceptor reactivity in the lung and thereby influence alveolar macrophage function. By combining ex vivo and intravital approaches, we have begun to map these interactions but agree that further in vivo studies are necessary to confirm causality, dissect signal transduction pathways, and fully validate Artemin’s contributions to AM–nociceptor crosstalk. We have revised our manuscript accordingly to highlight these limitations.

      A deeper exploration of the inflammatory parameters could be performed. The multiplex analysis of the cytokine analysis shows a reduction in certain cytokines like IL-6 and MCP (figure 3F), which needs to be discussed. Additionally, investigating the change in proportions of the different immune cell populations is important, which currently restricts the eosinophil and neutrophil counts in the BAL. This is also important as the study builds on work from Prof. Chang's group, which also identified the expansion of an invariant iNKT cell population by FPM, regulatory in nature. Adding data on airway hyperresponsiveness, if possible, would be a welcome addition, considering Asthma as the disease context.

      We thank the reviewer for highlighting the need for a more comprehensive exploration of inflammatory parameters. To address these concerns:

      (1) Cytokine Analysis: We re-ran all statistical analyses, including the CBA and ELISA assays, and confirmed that TNFα and Artemin are the only differentially expressed cytokines across experimental groups. We have expanded the Discussion to emphasize TNFα’s role in this context.

      (2) Immune Cell Profiling in BALF: Our data show that co-exposure with FPM exacerbates CD45+ cells, eosinophil, neutrophil, T-cells and monocyte infiltration. Notably, CD45+ cells and neutrophils were the only population reduced under nociceptor neuron loss-of-function conditions (QX314–treated or TRPV1-DTA mice, Author response image 1).

      Of note, we also confirmed these data using intravital imaging and in a second line of nociceptor ablated mice (NaV1.8DTA). We are aware of Prof. Chang’s work suggesting expansion of an invariant iNKT cell population this population in future

      (3) Airway Hyperresponsiveness (AHR): We recognize that adding AHR data would strengthen the asthma-related context. Unfortunately, we are not currently equipped to perform AHR measurements, but we intend to include this in future experiments to provide a more complete assessment of airway function.

      Author response image 1.

      The authors could revisit the data presented in terms of visualization. For instance, the pooled data presented in some of the figures is probably leading to a wide variation which makes interpretation more difficult. Presenting data separately for each experimental replicate might help the reader. This is also important considering the possible variation seen between experiments (for instance, in Figure 3A and 3C and 3B and 3D, the neutrophil and eosinophil panels for the same groups seem to have an almost 2-fold difference.). Similarly, in the cytokine analysis, the authors have used a common axis for depicting all cytokine values which leads to difficulties in interpretation (Figure 3F). Analysis of the RNA seq results and the DEGs could be revisited to include pathway analysis etc (Figure 2), and the supplementary information could include detailed lists of the major target genes.

      To address this query, we have completely reformatted all graphs and included both gene lists and lists of enriched pathways for all three comparisons in Supplementary Table 1. We also confirmed our flow cytometry analysis functionally by performing intravital imaging.

      The authors should also consider citing the previous experimental setup used for some particular protocols. For instance, the use of the specified protocol for OVA in a C57 background needs to be justified, as there are various protocols reported in the literature. Additionally, doses used in some experiments seem arbitrary (The FPM and Artemin exposure in Figure 4). Depicting the dose-response curve or citing previous literature for the same would be important. Similarly, different sample sizes seen in experiments should be explained, whether they are due to mortality, failure to exhibit phenotypes, or due to technical failures. The RNA seq experiment mentions only 2 biological replicates in one of the groups which should be addressed either by increasing the sample size or by replicating the experiment. Moreover, nested comparisons in experiments performed for Figure 1 need to be performed. Neurons isolated from each mouse should be maintained and analysed separately to retain biological replicates to better represent the heterogeneity.

      We appreciate the request for clarity regarding the experimental protocols and sample sizes:

      OVA Model in C57BL/6 Mice: We adapted a previously published OVA protocol in C57BL/6 mice[3-5] (PMID: 39661516), which uses two doses of sensitization to compensate for the lower Th2 response compared to BALB/c[6]. We increased the dose of OVA (100 µg) because our initial experiments produced low eosinophil infiltration. Although this dosage is on the higher side, some studies have noted local IFNγ induction in C57BL/6 mice; however, we did not detect IFNγ in our setup.

      FPM and Artemin Doses: We did not perform a full dose-response assay for FPM and Artemin but used 100 ng/mL as reported in prior literature, where TRPA1 and TRPV1 mRNA were upregulated after 18 hours of incubation[7]. This reference has been added for clarity.

      Sample Sizes and Exclusions: One control mouse was excluded from the RNA-seq experiment because a parallel PCA analysis indicated it was an outlier. This was the only exclusion in the study, and this have been indicated in the method section of the article.  

      Nested Comparisons and Biological Replicates: We reanalyzed the relevant data with a nested one-way ANOVA and updated the figures accordingly. Neurons isolated from each mouse were first averaged to preserve biological replicates and capture potential heterogeneity; and data was analysed on the per mouse averages.

      The manuscript should be more detailed regarding the statistics employed. Currently, there is a section mentioned in the methods section, but details of corrections employed and specific stats for specific experiments should be described. There are also some minor grammatical errors and incomplete sentences in the manuscript which should be corrected. The authors should also consider a more expansive literature review in the introduction/discussion sections.

      We have updated the figure legends and methods to include more detailed information on the specific statistical tests used for each experiment. In addition, we have fixed minor grammatical errors and incomplete sentences throughout the manuscript. Finally, we have expanded our Introduction and Discussion to include additional references and a broader literature context.

      Reviewer #2 (Public review):

      The authors sought to investigate the role of nociceptor neurons in the pathogenesis of pollutionmediated neutrophilic asthma.

      We thank the reviewer for their valuable comments, which have significantly enhanced the quality of our manuscript. A point-by-point rebuttal is provided below.

      Strength

      The authors utilize TRPV1 ablated mice to confirm effects of intranasally administered QX-314 utilized to block sodium currents. The authors demonstrate that via artemin, which is upregulated in alveolar macrophages in response to pollution, sensitizes JNC neurons thereby increasing their responsiveness to pollution. Ablation or inactivity of nociceptor neurons prevented the pollution induced increase in inflammation.

      Weakness

      While neutrophilic, the model used doesn't appear to truly recapitulate a Th2/Th17 phenotype.  No IL-17A is visible/evident in the BALF fluid within the model. (Figure 3F). Unclear of the relevance of the RNAseq dataset, none of the identified DEGs were evaluated in the context of mechanism. The authors overall achieved the aim of demonstrating that nociceptor neurons are important to the pathogenesis of pollutionexacerbated asthma. Their results support their conclusions overall, although there are ways the study findings can be strengthened. This work further evaluates how nociceptor neurons contribute to asthma pathogenesis important for consideration while proposing treatment strategies for undertreated asthma endotypes.

      Major

      Utilizing a different model, one using house dust mite or alternaria alternata or similar that is able to induce a true Th2/th17 type response that is also more translatable to humans for confirmation.

      We appreciate the suggestion to use additional allergen models. In a pilot study, we did observe increased Artemin in the BALF of house dust mite–treated mice, although the levels were low under our current dosing schedule (20 µg/dose daily from Day 0–4 and Day 7–9, with sacrifice on Day 10; Auhtor response image 2). Conversely, using an Alternaria alternata model at 100 µg/dose daily from Day 0–2 (sacrificed on Day 3) did not yield a detectable increase in Artemin. We suspect these findings may reflect the specific dose and timing used. We plan to refine our protocols (e.g., longer exposures or higher doses) for HDM and/or Alternaria to better model a Th2/Th17 response and further validate our observations in a setting more translatable to human asthma.

      Author response image 2.

      Additional analysis, maybe pathway analysis on the RNAseq dataset presented in Figure 2. Unclear how these genes are relevant/how they affect functionality. At present it is acceptable to say they are transcriptionally reprogramed, but no protein evaluation is provided which would get more at function, however, the authors do show some functional data in Figure 1, so maybe this could somehow be discussed/related to Figure 2.

      We have expanded our RNA-seq analysis to include gene lists and enriched pathways for all three comparisons in Supplementary Table 1. We have also revised our discussion to align these transcriptomic changes with the functional data shown in Figure 1. While we have not yet performed protein-level validation for all identified genes, the patterns observed in our RNA-seq dataset suggest pathways potentially tied to nociceptor activation and the downstream inflammatory response. We plan to conduct targeted protein analyses in future studies to further substantiate these findings.

      Histology and localization of neutrophils/nociceptor neurons/alveolar macrophages would enhance the study findings.

      We appreciate the reviewer’s suggestion to include histological data showing the distribution of neutrophils, nociceptor neurons, and alveolar macrophages. While we have not yet performed detailed histological staining of these cell types, we have added live in-vivo intravital microscopy data (Figure 4) that illustrate impaired AM and neutrophil motility in nociceptor-ablated mice. We plan to include additional histological analyses in future studies to further localize these cells in the lung tissue.

      Minor:

      The first 3 figures are small and hard to read.

      We have enlarged Figures 1 and 3 in the revised manuscript to improve readability. We have also added the corresponding gene lists and enriched pathways to Supplementary Table 1 for clarity.

      The figures are mislabeled in the text. Figure 2 is discussed twice in two different contexts; the second mention is supposed to be labeled as Figure 2.

      We corrected the mislabeled figures in the text, ensuring that each figure is referenced accurately.

      Figure 4 isn't cited in the text. I think it is supposed to be referenced in the paragraph before the discussion starts and is currently labeled as Figure 1.

      We have updated the text to properly cite Figure 4 in the relevant paragraph before the Discussion begins, rather than labeling it as Figure 1.

      Notating which statistical analysis was used with each figure/subfigure would be beneficial. Also, it's important to notate if the data was analyzed for multiple comparisons.

      We have revised each figure/subfigure legend to specify the statistical tests used, including information on whether corrections for multiple comparisons were applied. This provides a clearer understanding of how each dataset was analyzed.

      Reviewer #3 (Public review):

      Asthma is a complex disease that includes endogenous epithelial, immune, and neural components that respond awkwardly to environmental stimuli. Small airborne particles with diameters in the range of 2.5 micrometers or less, so-called PM2.5, are generally thought to contribute to some forms of asthma. These forms of asthma may have increased numbers of neutrophils and/or eosinophils present in bronchoalveolar lavage fluid and are difficult to treat effectively as they tend to be poorly responsive to steroids. Here, Wang and colleagues build on a recent model that incorporated PM2.5 which was found to have a neutrophilic component. Wang altered the model to provide an extra kick via the incorporation of ovalbumin. Building on their prior expertise linking nociceptors and inflammation, they find that silencing TRPV1-expressing neurons either pharmacologically or genetically, abrogated inflammation and the accumulation of neutrophils. By examining bronchoalveolar lavage fluid, they found not only that levels of the number of cytokines were increased, but also that artemin, a protein that supports neuronal development and function, was elevated, which did not occur in nociceptor-ablated mice. They also found that alveolar macrophages exposed to PM2.5 particles had increased artemin transcription, suggesting a further link between pollutants, and immune and neural interactions.

      We thank the reviewer for their valuable comments, which have significantly enhanced the quality of our manuscript. A point-by-point rebuttal is provided below.

      Weakness

      There are substantial caveats that must be attached to the suggestions by the authors that targeting nociceptors might provide an approach to the treatment of neutrophilic airway inflammation in pollutiondriven asthma in general and wildfire-associated respiratory problems in particular.  

      These caveats include the uncertainty of the relevance of the conventional source of PM2.5, to pollution and asthma. According to the National Institute of Standards and Technology (NIST), the standard reference material (SRM) 2786 is a mix obtained from an air intake system in the Czech Republic. It is not clear exactly what is in the mix, and a recent bioRxiv preprint, https://www.biorxiv.org/content/10.1101/2023.08.18.553903v3.full.pdf reveals the presence of endotoxin. Care should thus be taken in interpreting data using particulate matter. Regarding wildfires, there is data that indicates that such exposure is toxic to macrophages. What impact might that then have on the production of cytokines, and artemin, in humans?

      We recognize the potential limitations of using SRM2786 (obtained from a Czech air-intake system) as a model for realworld PM2.5 exposure. Our rationale for choosing SRM2786 is that it is commercially available and represents a broad spectrum of ambient air pollutants, in contrast to more specialized sources like diesel exhaust particles. However, we acknowledge in the discussion the presence of endotoxin in SRM2786, as suggested by recent reports, and agree that this may influence immune responses and should be considered when interpreting our data.

      Regarding wildfire-associated exposure, we are aware that certain components of wildfire smoke can be toxic to macrophages. We do not think this play a significant role in the current study design as number of AMs, as determined by flow cytometry and intravital microscopy, are similar when comparing OVA-exposed mice to OVA-FPM exposed animals. Thus, these results rule out significant AM toxicity by FPM.

      Ultimately, while our findings suggest that modulating nociceptor activity may reduce neutrophilic inflammation, we emphasize that additional research—including different PM2.5 sources, validation of endotoxin levels, and in vivo confirmation in human-relevant models—is necessary before drawing definitive conclusions about treating pollutiondriven asthma or wildfire-induced respiratory problems.

      The Introductory paragraph implies links between wildfire events, particular exposure, and neutrophilic asthma. I am not aware of such a link having been established, in which case the paragraph needs revision. In the paragraph that begins with 'Urban pollution', it is suggested that eosinophilic asthma is treatment responsive in comparison to the neutrophilic form. That may not be the case, and they may often these cellular components may occur together. In much of the manuscript, there is a mismatch between the text and the figure numbers. For example, in the Results, Figure 2 should be Figure 3 some of the time, and Figure 3 is actually Figure 4, while the reference to Figure 1F-H is Figure 4H. Please check carefully.

      (a) Introduction Paragraph and Wildfire–Neutrophilic Asthma Link

      We add references to the introduction to support the link between wildfire, respiratory symptoms and the link to neutrophilic asthma [8-12].

      (b) Distinction Between Eosinophilic and Neutrophilic Asthma

      We recognize that eosinophilic and neutrophilic airway infiltrates can co-occur in the same individual and that treatment responsiveness can vary considerably. Our intention was to note that conventional asthma therapies (e.g., inhaled corticosteroids) are generally more effective for eosinophilic-driven disease than for neutrophilic phenotypes, but we agree that these inflammatory endotypes often overlap in clinical practice. We have revised the text in the “Urban pollution” section to acknowledge this complexity and to clarify that inflammatory cell populations in asthma are not always discrete.

      Figure Numbering and Text–Figure Mismatch

      We sincerely apologize for the confusion caused by mismatched figure labels and references in the Results section. We have carefully reviewed and corrected all figure references throughout the manuscript to ensure accuracy.

      References

      (1) Kim, S. H. et al. Mapping of the Sensory Innervation of the Mouse Lung by Specific Vagal and Dorsal Root Ganglion Neuronal Subsets. eNeuro 9 (2022). https://doi.org/10.1523/ENEURO.0026-22.2022

      (2) McGovern, A. E. et al. Evidence for multiple sensory circuits in the brain arising from the respiratory system: an anterograde viral tract tracing study in rodents. Brain Struct Funct 220, 3683-3699 (2015). https://doi.org/10.1007/s00429-014-0883-9

      (3) Shen, C.-C., Wang, C.-C., Liao, M.-H. & Jan, T.-R. A single exposure to iron oxide nanoparticles attenuates antigen-specific antibody production and T-cell reactivity in ovalbumin-sensitized BALB/c mice. International journal of nanomedicine, 1229-1235 (2011).  

      (4) Delayre-Orthez, C., De Blay, F., Frossard, N. & Pons, F. Dose-dependent effects of endotoxins on allergen sensitization and challenge in the mouse. Clinical & Experimental Allergy 34, 1789-1795 (2004).  

      (5) Morokata, T., Ishikawa, J. & Yamada, T. Antigen dose defines T helper 1 and T helper 2 responses in the lungs of C57BL/6 and BALB/c mice independently of splenic responses. Immunology letters 72, 119-126 (2000).  

      (6) Li, L., Hua, L., He, Y. & Bao, Y. Differential effects of formaldehyde exposure on airway inflammation and bronchial hyperresponsiveness in BALB/c and C57BL/6 mice. PLoS One 12, e0179231 (2017).  

      (7) Ikeda-Miyagawa, Y. et al. Peripherally increased artemin is a key regulator of TRPA1/V1 expression in primary afferent neurons. Molecular pain 11, s12990-12015-10004-12997 (2015).  

      (8) Baan, E. J. et al. Characterization of Asthma by Age of Onset: A Multi-Database Cohort Study. J Allergy Clin Immunol Pract 10, 1825-1834 e1828 (2022). https://doi.org/10.1016/j.jaip.2022.03.019

      (9) de Nijs, S. B., Venekamp, L. N. & Bel, E. H. Adult-onset asthma: is it really different? Eur Respir Rev 22, 44-52 (2013). https://doi.org/10.1183/09059180.00007112

      (10) Gianniou, N. et al. Acute effects of smoke exposure on airway and systemic inflammation in forest firefighters. J Asthma Allergy 11, 81-88 (2018). https://doi.org/10.2147/JAA.S136417

      (11) Noah, T. L., Worden, C. P., Rebuli, M. E. & Jaspers, I. The Effects of Wildfire Smoke on Asthma and Allergy. Curr Allergy Asthma Rep 23, 375-387 (2023). https://doi.org/10.1007/s11882-023-01090-1

      (12) Wilgus, M. L. & Merchant, M. Clearing the Air: Understanding the Impact of Wildfire Smoke on Asthma and COPD. Healthcare (Basel) 12 (2024). https://doi.org/10.3390/healthcare12030307

    1. Author Response

      The following is the authors’ response to the current reviews.

      We thank the editors and reviewers for their helpful comments, which have allowed us to improve the manuscript.

      Response to reviewer 2

      We thank the reviewer for this positive feedback, which requires no further revision.

      Response to reviewer 3

      We thank the reviewer for highlighting these additional points and provide further explanations on these below.

      Firstly, we started the analysis from a baseline of year 2000 because the largest international donor (the Global Fund) uses baseline malaria levels in the period 2000-2004 as the basis of their current allocation calculations (The Global Fund, Description of the 2020-2022 Allocation Methodology, December 2019). In the paper we compare our optimal strategy to a simplified version of this method, represented by our “proportional allocation” strategy.

      Even if our simulations started in the year 2015, a direct comparison with the Global Technical Strategy for Malaria 2016-2030 would not be possible due to the different approaches taken. The GTS was developed to progress towards malaria elimination globally and set ambitious targets of at least 90% reduction in malaria case incidence and mortality rates and malaria elimination in at least 35 countries by 2030 compared to 2015. Mathematical modelling at the time suggested that 90% coverage of WHO-recommended interventions (vector control, treatment and seasonal malaria chemoprevention) would be needed to approach this target (Griffin et al. 2016, Lancet Infectious Diseases). The global annual investment requirements to meet GTS targets were estimated at US$6.4 billion by 2020 and US$8.7 billion by 2030 (Patouillard et al. 2017, BMJ Global Health). This strategy therefore considers what resources would be required to achieve a specific global target, but not the optimized allocation of resources.

      Investments into malaria control have consistently been below the estimated requirements for the GTS milestones (World Health Organization 2022, World Malaria Report 2022). In our study, we therefore take a different perspective on how limited budgets can be optimally allocated to a single intervention (insecticide-treated nets) across countries/settings to achieve the best possible outcome for two objectives that are different to the GTS milestones (either minimizing the global case burden, or minimizing both the global case burden and the number of settings not having yet reached a pre-elimination phase). As stated in the discussion, our estimate of allocating 76% of very low budgets to high-transmission settings was similar to the global investment targets estimated for the GTS, where the 20 countries with the highest burden in 2015 were estimated to require 88% of total investments (Patouillard et al. 2017, BMJ Global Health). Nevertheless, we also show that if higher budgets were available, allocating the majority to low-transmission settings co-endemic for P. falciparum and P. vivax would achieve the largest reduction in global case burden. We acknowledge the modelling of a single intervention as one of the key limitations of this analysis, but this simplification was necessary in order to perform the complex optimisation problem. Computationally it would not have been feasible to optimize across a multitude of intervention and coverage combinations.

      A further limitation raised by the reviewer is the lack of cross-species immunity between P. falciparum and P. vivax in our model. While cross-reactivity between antibodies against these two species has been observed in previous studies and the potential implications of this would be important to explore in future work, we did not include it here as little is known to date about the epidemiological interactions between different malaria parasite species (Muh et al. 2020, PLoS Neglected Tropical Diseases).

      Lastly, we did not assume that transmission was homogenous within the four transmission settings in our study (very low, low, moderate, high); transmission dynamics were simulated separately in each country, accounting for heterogeneous mosquito bite exposure. However, results were summarised for the broader transmission settings since many other country-specific factors were not accounted for (see discussion) and the findings should not be used to inform individual country allocation decisions.


      The following is the authors’ response to the original reviews.

      Author response to peer review

      We thank the reviewers for their insightful comments, which raise several important points regarding our study. As the reviewers have recognised, we introduced a number of simplifications in order to perform this complex optimisation problem, such as by restricting the analysis to a single intervention (insecticide-treated nets) and modelling countries at a national level. Despite their clear relevance to the study, computationally it would not have been feasible to run the multitude of scenarios suggested by reviewer 1, which we recognise as a limitation. As such we agree with the assessment that this study primarily represents a thought experiment, based on substantive modelling and aggregate scenario-based analysis, to assess whether current policies are aligned with an optimal allocation strategy or whether there might be a need to consider alternative strategies. The findings are relevant primarily to global funders and should not be used to inform individual country allocation decisions, and also point to avenues for further research. This perspective also underlies our decision to start the analysis from a baseline of year 2000 as opposed to modelling the current 2023 malaria situation: the largest international donor (the Global Fund) uses baseline malaria levels in the period 2000-2004 as the basis of their allocation calculations (The Global Fund, Description of the 2020-2022 Allocation Methodology, December 2019) (1). A simplified version of this method is represented by our “proportional allocation” strategy. We have made several revisions to the manuscript to address the points raised by the reviewers, as detailed below.

      Reviewer #1 (Public Review):

      1. The authors present a back-of-the-envelope exploration of various possible resource allocation strategies for ITNs. They identify two optimal strategies based on two slightly different objective functions and compare 3 simple strategies to the outcomes of the optimal strategies and to each other. The authors consider both P falciparum and P vivax and explore this question at the country level, using 2000 prevalence estimates to stratify countries into 4 burden categories. This is a relevant question from a global funder perspective, though somewhat less relevant for individual countries since countries are not making decisions at the global scale.

      Thank you for this summary of the paper. We agree that our analysis is of relevance to global funders, but is not meant to inform individual country allocation decisions. In the discussion, we now state:

      p. 12 L19: “Therefore, policy decisions should additionally be based on analysis of country-specific contexts, and our findings are not informative for individual country allocation decisions.”

      1. The authors have made various simplifications to enable the identification of optimal strategies, so much so that I question what exactly was learned. It is not surprising that strategies that prioritize high-burden settings would avert more cases.

      Thank you for raising this point. Indeed, several simplifying assumptions were necessary to ensure the computational feasibility of this complex optimization problem. As a result, our study primarily represents a thought experiment to assess whether current policies are aligned with an optimal allocation strategy or whether there might be a need to consider alternative strategies. As now further outlined in the introduction, approaches to this have differed over time and it remains a relevant debate for malaria policy.

      p. 2 L22: “However, there remains a lack of consensus on how best to achieve this longer-term aspiration. Historically, large progress was made in eliminating malaria mainly in lower-transmission countries in temperate regions during the Global Malaria Eradication Program in the 1950s, with the global population at risk of malaria reducing from around 70% of the world population in 1950 to 50% in 2000 (2). Renewed commitment to malaria control in the early 2000s with the Roll Back Malaria initiative subsequently extended the focus to the highly endemic areas in sub-Saharan Africa (3).”

      We believe our findings not only confirm an “expected” outcome – that prioritizing high-burden settings would avert more cases – but also clearly illustrate various consequences of different allocation strategies that are implemented or considered in reality, which may not be so obvious. For example, we found that initially allocating a larger share of the budget to high-transmission countries could be both almost optimal in terms of reducing clinical cases and maximising the number of countries reaching pre-elimination. We also observed a trade-off between reducing burden and reducing the global population at risk (“shrinking the map”) through a focus on near-elimination settings, and estimate the loss in burden reduction when following an elimination target.

      1. Generally, I found much of the text confusing and some concepts were barely explained, such that the logic was difficult to follow.

      Thank you for bringing this to our attention, and we regret to hear the manuscript was confusing to read. We believe that the revisions made as a result of the reviewer comments have now made the manuscript much easier to follow. We additionally passed the manuscript to a colleague to identify confusing passages, and have added a number of sentences to clarify key concepts and improve the structure.

      1. I am not sure why the authors chose to stratify countries by 2000 PfPR estimates and in essence explore a counterfactual set of resource allocation strategies rather than begin with the present and compare strategies moving forward. I would think that beginning in 2020 and modeling forward would be far more relevant, as we can't change the past. Furthermore, there was no comparison with allocations and funding decisions that were actually made between 2000 and 2020ish so the decision to begin at 2000 is rather confusing.

      Thank you for pointing this out. We have now made the rationale for this choice clearer in the manuscript. Our main reason for this was to allow comparison with the Global Fund funding allocation, which is largely based on malaria disease burden in 2000-2004. As stated in the paper, malaria prevalence estimates in the year 2000 are commonly considered to represent a “baseline” endemicity level, before large-scale implementation of interventions in the following decades. In the manuscript, the transmission-related element of the Global Fund allocation algorithm is represented in our “proportional allocation” strategy. Previously this was only mentioned in the methods, but we have now added the following in the results to address this comment of the reviewer:

      p. 6 L12: “Strategies prioritizing high- or low-transmission settings involved sequential allocation of funding to groups of countries based on their transmission intensity (from highest to lowest EIR or vice versa). The proportional allocation strategy mimics the current allocation algorithm employed by the Global Fund: budget shares are mainly distributed according to malaria disease burden in the 2000-2004 period. To allow comparison with this existing funding model, we also started allocation decisions from the year 2000.”

      The Global Fund framework additionally considers economic capacity and other specific factors, and we have now also included a direct comparison with the 2020-2022 Global Fund allocation in Supplementary Figure S12 (see Author response image 1).

      We agree that looking at allocation decisions from 2020 onward would also constitute a very interesting question. However, the high dimensionality in scenarios to consider for this would currently make it computationally infeasible to run on the global level. Not only would it have to include all interventions currently implemented and available for malaria at different levels of coverage, but also the option of scaling down existing interventions. Instead, our priority in this paper was to conduct a thought experiment including both P. falciparum and P. vivax on a large geographical scale.

      Author response image 1.

      Impact of the proportional allocation strategy and the 2020-2022 Global Fund allocation on global malaria cases (panel A) and the total population at risk of malaria (panel B) at varying budgets. Both strategies use the same algorithm for budget share allocation based on malaria disease burden in 2000-2004, but the Global Fund allocation additionally involves an economic capacity component and specific strategic priorities.

      1. I realize this is a back-of-the-envelope assessment (although it is presented to be less approximate than it is, and the title does not reveal that the only intervention strategy considered is ITNs) but the number and scope of modeling assumptions made are simply enormous. First, that modeling is done at the national scale, when transmission within countries is incredibly heterogeneous. The authors note a differential impact of ITNs at various transmission levels and I wonder how the assumption of an intermediate average PfPR vs modeling higher and lower PfPR areas separately might impact the effect of the ITNs.

      Thank you for this comment. We agree the title could be more specific and have changed this to “Resource allocation strategies for insecticide-treated bednets to achieve malaria eradication”.

      Regarding the scale of ITN allocation, it is true that allocation at a sub-national scale could affect the results. However, considering this at a national scale is most relevant for our analysis because this is the scale at which global funding allocation decisions are made in practice. A sentence explaining this has been added in the methods.

      p. 15 L8: “The analysis was conducted on the national level, since this scale also applies to funding decisions made by international donors (1).”

      Further considering different geographical scales would also require introducing other assumptions, for example about how different countries would distribute funding sub-nationally, whether specific countries would take cooperative or competitive approaches to tackle malaria within a region or in border areas, and about delays in the allocation of bednets in specific regions. These interesting questions were outside of the scope of this work, but certainly require further investigation.

      1. Second, the effect of ITNs will differ across countries due to variations in vector and human behavior and variation in insecticide resistance and susceptibility to the ITNs. The authors note this as a limitation but it is a little mind-boggling that they chose not to account for either factor since estimates are available for the historical period over which they are modeling.

      Thank you for pointing this out. We did consider this and mentioned it as a limitation. Nevertheless, the complexity of accounting for this should also be recognised; for example, there is substantial uncertainty about the precise relationship between insecticide resistance and the population-level effect of ITNs (Sherrard-Smith et al., 2022, Lancet Planetary Health) (4). Additionally, our simulations extend beyond the 2000-2023 period so further assumptions about future changes to these factors would also be required. Simplifying assumptions are inherent to all mathematical modelling studies and we consider these particular simplifications acceptable given the high-level nature of the analysis.

      1. Third, the assumption that elimination is permanent and nothing is needed to prevent resurgence is, as the authors know, a vast oversimplification. Since resources will be needed to prevent resurgence, it appears this assumption may have a substantial impact on the authors' results.

      Thank you for this comment. In the discussion, we have now expanded on this:

      p. 13 L3: “While our analysis presents allocation strategies to progress towards eradication, the results do not provide insight into allocation of funding to maintain elimination. In practice, the threat of malaria resurgence has important implications for when to scale back interventions.”

      We believe that from a global perspective, the questions of funding allocation to achieve elimination vs to maintain it can currently still be considered separately given the large time-scales involved. The cost of preventing resurgence is not known, and one major problem in accounting for this would also be to identify relevant timescales to quantify this over.

      1. The decision to group all settings with EIR > 7 together as "high transmission" may perhaps be driven by WHO definitions but at a practical level this groups together countries with EIR 10 and EIR 500. Why not further subdivide this group, which makes sense from a technical perspective when thinking about optimal allocation strategies?

      Thank you for pointing this out. The WHO categories used are better interpreted in terms of the corresponding prevalence, which places countries with a prevalence of over 35% in the high transmission categories (WHO Guidelines for malaria, 31 March 2022) (5). We felt this is appropriate given that we are looking at theoretical global allocation patterns and do not aim to make recommendations for specific groups of countries or individual countries within sub-Saharan Africa that would be distinguished through the use of higher cut-offs. In our analysis, all 25 countries in the high transmission category were located in sub-Saharan Africa.

      1. The relevance of this analysis for elimination is a little questionable since no one eliminates with ITNs alone, to the best of my understanding.

      Thank you for this comment. We indeed state in the paper that ITNs alone are not sufficient to eliminate malaria. However, we still think that our analysis is relevant for elimination by taking a more theoretical perspective on reducing transmission using interventions. Starting from the 2000 baseline (or current levels) globally, large-scale transmission reductions such as those achieved by mass ITN distribution still represent the first key step on the path to malaria eradication, as shown in previous modelling work (Griffin et al., 2016, Lancet Infectious Diseases) (6). In the final phase of elimination, the WHO also recommends the addition of more targeted and reactive interventions (WHO Guidelines for malaria, 31 March 2022) (5). Our changes to the title of the article (“Resource allocation strategies for insecticide-treated bednets to achieve malaria eradication”) should now better reflect that we consider ITNs as just one necessary component to achieve malaria eradication.

      Reviewer #2 (Public Review):

      1. Schmit et al. analyze and compare different strategies for the allocation of funding for insecticide-treated nets (ITNs) to reduce the global burden of malaria. They use previously published models of Plasmodium falciparum and Plasmodium vivax malaria transmission to quantify the effect of ITN distribution on clinical malaria numbers and the population at risk. The impact of different resource allocation strategies on the reduction of malaria cases or a combination of malaria cases and achieving pre-elimination is considered to determine the optimal strategy to allocate global resources to achieve malaria eradication.

      Strengths:

      Schmit et al. use previously published models and optimization for rigorous analysis and comparison of the global impact of different funding allocation strategies for ITN distribution. This provides evidence of the effect of three different approaches: the prioritization of high-transmission settings to reduce the disease burden, the prioritization of low-transmission settings to "shrink the malaria map", and a resource allocation proportional to the disease burden.

      Thank you for providing this summary and outline of the strengths of the paper.

      1. Weaknesses:

      The analysis and optimization which provide the evidence for the conclusions and are thus the central part of this manuscript necessitate some simplifying assumptions which may have important practical implications for the allocation of resources to reduce the malaria burden. For example, seasonality, mosquito species-specific properties, stochasticity in low transmission settings, and changing population sizes were not included. Other challenges to the reduction or elimination of malaria such as resistance of parasites and mosquitoes or the spread of different mosquito species as well as other beneficial interventions such as indoor residual spraying, seasonal malaria chemoprevention, vaccinations, combinations of different interventions, or setting-specific interventions were also not included. Schmit et al. clearly state these limitations throughout their manuscript.

      The focus of this work is on ITN distribution strategies, other interventions are not considered. It also provides a global perspective and analysis of the specific local setting (as also noted by Schmit et al.) and different interventions as well as combinations of interventions should also be taken into account for any decisions.

      Thank you for raising these points. As outlined at the beginning of our response, for computational reasons we indeed had to introduce several simplifying assumptions to perform this complex optimisation problem. As a result of these factors you highlighted, our study should primarily be interpreted as a thought experiment to assess whether current policies are aligned with an optimal allocation strategy or whether there might be a need to consider alternative strategies. The findings are relevant primarily to global funders and should not be used to inform individual country allocation decisions, which we have further clarified in the manuscript.

      1. Nonetheless, the rigorous analysis supports the authors' conclusions and provides evidence that supports the prioritization of funding of ITNs for settings with high Plasmodium falciparum transmission. Overall, this work may contribute to making evidence-based decisions regarding the optimal prioritization of funding and resources to achieve a reduction in the malaria burden.

      Thank you for this positive assessment of our work.

      Reviewer #1 (Recommendations For The Authors):

      1. L144: last paragraph, the focus on endemic equilibrium: I did not really understand this, when 39 years is mentioned later is that a different analysis? How are cases averted calculated in a time-agnostic endemic equilibrium analysis? Perhaps a little more detail here would be helpful.

      A further explanation of this has been added in the results and methods.

      p. 8 L 22: “To evaluate the robustness of the results, we conducted a sensitivity analysis on our assumption on ITN distribution efficiency. Results remained similar when assuming a linear relationship between ITN usage and distribution costs (Figure S10). While the main analysis involves a single allocation decision to minimise long-term case burden (leading to a constant ITN usage over time in each setting irrespective of subsequent changes in burden), we additionally explored an optimal strategy with dynamic re-allocation of funding every 3 years to minimise cases in the short term.”

      p. 17 L25: “To ensure computational feasibility, 39 years was used as it was the shortest time frame over which the effect of re-distribution of funding from countries having achieved elimination could be observed.”

      p. 18 L 9: “Global malaria case burden and the population at risk were compared between baseline levels in 2000 and after reaching an endemic equilibrium under each scenario for a given budget.”

      1. L148: what is proportional allocation by disease burden and how is that different from prioritizing high-transmission settings?

      Further details have been added in the text.

      p. 6 L12: “Strategies prioritizing high- or low-transmission settings involved sequential allocation of funding to groups of countries based on their transmission intensity (from highest to lowest EIR or vice versa). The proportional allocation strategy mimics the current allocation algorithm employed by the Global Fund: budget shares are mainly distributed according to malaria disease burden in the 2000-2004 period. To allow comparison with this existing funding model, we also started allocation decisions from the year 2000.”

      1. L198-9: did low transmission settings get the majority of funding at intermediate and maximum budgets because they have the most population (I think so, based on Fig 1)?

      Yes, this is correct. We state in the results: “the optimized distribution of funding to minimize clinical burden depended on the available global budget and was driven by the setting-specific transmission intensity and the population at risk”.

      1. L206: what is ITN distribution efficiency? This is not explained. What is the 39-year period? Why this duration?

      Further explanations have been added in the results section, which were previously only detailed in the methods:

      p. 8 L 22: “To evaluate the robustness of the results, we conducted a sensitivity analysis on our assumption on ITN distribution efficiency. Results remained similar when assuming a linear relationship between ITN usage and distribution costs (Figure S10)."

      p. 17 L25: “To ensure computational feasibility, 39 years was used as it was the shortest time frame over which the effect of re-distribution of funding from countries having achieved elimination could be observed.”

      1. L218: what is "no intervention with a high budget"? is this a phrasing confusion?

      Yes, this has been changed.

      p. 9 L14: “We estimated that optimizing ITN allocation to minimize global clinical incidence could, at a high budget, avert 83% of clinical cases compared to no intervention.”

      1. L235-7: on comparing these results to previous work on the 20 highest-burden countries: is the definition of "high" similar enough across these studies that this is a relevant comparison?

      We believe this is reasonably comparable, as looking at the 20 highest-burden countries encompasses almost the entire high-transmission group in our work (25 countries in total), on which the comparison is made.

      1. L267-70: I didn't understand this sentence at all.

      Thanks for flagging this. The sentence referred to is: “Allocation proportional to disease burden did not achieve as great an impact as other strategies because the funding share assigned to settings was constant irrespective of the invested budget and its impact, and we did not reassign excess funding in high-transmission settings to other malaria interventions.”

      The previously mentioned added details on the proportional allocation strategy in the manuscript should now make this clearer, together with this clarification:

      p. 11 L17: “In modelling this strategy, we did not reassign excess funding in high-transmission settings to other malaria interventions, as would likely occur in practice.”

      For proportional allocation, a fixed proportion of the budget is calculated for each country based on disease burden, as described in the Global Fund allocation documentation (see Methods). However, since ITNs are the only intervention considered, this leads to a higher budget being allocated than is needed in some countries (i.e. where more funding doesn’t translate into further health gains).

      1. L339 EIR range: 80 is high at the country level but areas within countries probably went as high as 500 back in 2000. How does this affect the modeled estimates of ITN impact?

      The question of sub-national differences in transmission has been addressed in the public review comments. Briefly, we consider the national scale to be most relevant for our analysis because this is the scale at which global funding allocation decisions are made in practice. Although, as you correctly point out, the EIR affects ITN impact, it is not possible to conclude what the average effect of this would be on the country level without considering the following factors and introducing further assumptions on these: how would different countries distribute funding sub-nationally? Which countries would take cooperative or competitive approaches to tackle malaria within a region or in border areas? Would there be delays in the allocation of bednets in specific regions? These interesting questions were outside of the scope of this work, but certainly require further investigation.

      1. L347 population size constant: births and deaths are still present, is that right? Unclear from this sentence

      Yes, this is correct. Full details on the model can be found in the Supplementary Materials.

      1. L370 estimating ITN distribution required to achieve simulated population usage: is this a single relationship for all of Africa? Is it based on ITNs distributed 2:1 -> % access -> % usage? So it accounts for allocation inefficiency?

      Yes, this is represented by a single relationship for all of Africa to account for allocation inefficiency and is based on observed patterns across the continent and methodology developed in a previous publication (Bertozzi-Villa et al., 2021, Nature Communications) (7). Full details can be found in the Supplementary Materials (“Relationship between distribution and usage of insecticide-treated nets (ITNs)”, p. 21).

      1. L375: the ITN unit cost is assumed constant across countries and time (I think, it doesn't say explicitly), is this a good assumption?

      Yes, this is correct. We consider this a reasonable assumption within the scope of the paper. While delivery costs likely vary across countries, international funders usually have pooled procurement mechanisms for ITNs (The Global Fund, 2023, Pooled Procurement Mechanism Reference Pricing: Insecticide-Treated Nets).

      1. L399: "single allocation of a constant ITN usage" it is not explained what exactly this means

      Further explanations have been added in the manuscript.

      p. 8 L24: “While the main analysis involves a single allocation decision to minimise long-term case burden (leading to a constant ITN usage over time in each setting irrespective of subsequent changes in burden), we additionally explored an optimal strategy with dynamic re-allocation of funding every 3 years to minimise cases in the short term.”

      Reviewer #2 (Recommendations For The Authors):

      1. Additionally to the public comments, the only major comment is that in this reviewer's opinion, the focus on ITNs as the only intervention should be made clearer at different places in the manuscript (e.g. in the discussion lines 303-304). Otherwise, there are only some minor comments (see below).

      We have now modified the following sentence and also included this suggestion in the title (“Resource allocation strategies for insecticide-treated bednets to achieve malaria eradication”).

      p. 13 L8: “Our analysis demonstrates the most impactful allocation of a global funding portfolio for ITNs to reduce global malaria cases.”

      1. Minor comments:
      2. It may be of interest to compare the maximum budget obtained from the optimization with other estimates of required funding and actual available funding.

      Thank you for this interesting suggestion. Our maximum budget estimates are similar to the required investments projected for the WHO Global Technical Strategy: US$3.7 billion for ITNs in our analysis compared to between US$6.8 and US$10.3 billion total annual resources between 2020 and 2030, of which an estimated 55% would be required for (all) vector control (US$3.7 - US$5.7 billion) (Patouillard et al., 2016, BMJ Global Health) (8). However, it is well known that current spending is far below these requirements: total investments in malaria were estimated to be about US$3.1 billion per year in the last 5 years (World Health Organization, 2022, World Malaria Report 2022) (9).

      1. Line 177: should "Figure S7" be bold?

      Yes, this has been corrected.

      1. Line 218: what does "no intervention with high budget" mean? Should this simply be "no intervention"?

      This has been changed.

      p. 9 L14: “We estimated that optimizing ITN allocation to minimize global clinical incidence could, at a high budget, avert 83% of clinical cases compared to no intervention.”

      1. In this reviewer's opinion it would be easier for the reader if the weighting term in the objective function would be added in the Materials and Methods section. The weighting could be added without extending the section substantially and the explanation in lines 390-393 may be easier to understand.

      Thank you for this suggestion. We agree and have added this in the main manuscript.

      References

      1. The Global Fund. Description of the 2020-2022 Allocation Methodology 2019 [Available from: https://www.theglobalfund.org/media/9224/fundingmodel_2020-2022allocations_methodology_en.pdf.

      2. Hay SI, Guerra CA, Tatem AJ, Noor AM, Snow RW. The global distribution and population at risk of malaria: past, present, and future. Lancet Infect Dis. 2004;4(6):327-36.

      3. Feachem RGA, Phillips AA, Hwang J, Cotter C, Wielgosz B, Greenwood BM, et al. Shrinking the malaria map: progress and prospects. The Lancet. 2010;376(9752):1566-78.

      4. Sherrard-Smith E, Winskill P, Hamlet A, Ngufor C, N'Guessan R, Guelbeogo MW, et al. Optimising the deployment of vector control tools against malaria: a data-informed modelling study. The Lancet Planetary Health. 2022;6(2):e100-e9.

      5. World Health Organization. WHO Guidelines for malaria, 31 March 2022. Geneva: World Health Organization; 2022. Contract No.: Geneva WHO/UCN/GMP/ 2022.01 Rev.1.

      6. Griffin JT, Bhatt S, Sinka ME, Gething PW, Lynch M, Patouillard E, et al. Potential for reduction of burden and local elimination of malaria by reducing Plasmodium falciparum malaria transmission: a mathematical modelling study. The Lancet Infectious Diseases. 2016;16(4):465-72.

      7. Bertozzi-Villa A, Bever CA, Koenker H, Weiss DJ, Vargas-Ruiz C, Nandi AK, et al. Maps and metrics of insecticide-treated net access, use, and nets-per-capita in Africa from 2000-2020. Nature Communications. 2021;12(1):3589.

      8. Patouillard E, Griffin J, Bhatt S, Ghani A, Cibulskis R. Global investment targets for malaria control and elimination between 2016 and 2030. BMJ global health. 2017;2(2):e000176.

      9. World Health Organization. World malaria report 2022. Geneva: World Health Organization; 2022. Report No.: 9240064893.

    1. Author Response

      The following is the authors’ response to the original reviews.

      eLife assessment

      This work presents H3-OPT, a deep learning method that effectively combines existing techniques for the prediction of antibody structure. This work is important because the method can aid the design of antibodies, which are key tools in many research and industrial applications. The experiments for validation are solid.

      Comments to Author:

      Several points remain partially unclear, such as:

      1). Which examples constitute proper validation;

      Thank you for your kind reminder. We have modified the text of the experiments for validation to identify which examples constitute proper validation. We have corrected the “Finally, H3-OPT also shows lower Cα-RMSDs compared to AF2 or tFold-Ab for the majority of targets in an expanded benchmark dataset, including all antibody structures from CAMEO 2022” into “Finally, H3-OPT also shows lower Cα-RMSDs compared to AF2 or tFold-Ab for the majority (six of seven) of targets in an expanded benchmark dataset, including all antibody structures from CAMEO 2022” and added the following sentence in the experimental validation section of our revised manuscript to clarify which examples constitute proper validation: “AlphaFold2 outperformed IgFold on these targets”.

      2) What the relevance of the molecular dynamics calculations as performed is;

      Thank you for your comment, and I apologize for any confusion. The goal of our molecular dynamics calculations is to compare the differences in binding affinities, an important issue of antibody engineering, between AlphaFold2-predicted complexes and H3-OPT-predicted complexes. Molecular dynamics simulations enable the investigation of the dynamic behaviors and interactions of these complexes over time. Unlike other tools for predicting binding free energy, MM/PBSA or MM/GBSA calculations provide dynamic properties of complexes by sampling conformational space, which helps in obtaining more accurate estimates of binding free energy. In summary, our molecular dynamics calculations demonstrated that the binding free energies of H3-OPT-predicted complexes are closer to those of native complexes. We have included the following sentence in our manuscript to provide an explanation of the molecular dynamics calculations: “Since affinity prediction plays a crucial role in antibody therapeutics engineering, we performed MD simulations to compare the differences in binding affinities between AF2-predicted complexes and H3-OPT-predicted complexes.”.

      3) The statistics for some of the comparisons;

      Thank you for the comment. We have incorporated statistics for some of the comparisons in the revised version of our manuscript and added the following sentence in the Methods section: “We conducted two-sided t-test analyses to assess the statistical significance of differences between the various groups. Statistical significance was considered when the p-values were less than 0.05. These statistical analyses were carried out using Python 3.10 with the Scipy library (version 1.10.1).”.

      4) The lack of comparison with other existing methods.

      We appreciate your valuable comments and suggestions. Conducting comparisons with a broader set of existing methods can further facilitate discussions on the strengths and weaknesses of each method, as well as the accuracy of our method. In our study, we conducted a comparison of H3-OPT with many existing methods, including AlphaFold2, HelixFold-Single, ESMFold, and IgFold. We demonstrated that several protein structure prediction methods, such as ESMFold and HelixFold-Single, do not match the accuracy of AlphaFold2 in CDR-H3 prediction. Additionally, we performed a detailed comparison between H3-OPT, AlphaFold2, and IgFold (the latest antibody structure prediction method) for each target.

      We sincerely thank the comment and have introduced a comparison with OmegaFold. The results have been incorporated into the relevant sections (Fig 4a-b) of the revised manuscript.

      Author response image 1.

      Public Reviews

      Comments to Author:

      Reviewer #1 (Public Review):

      Summary:

      The authors developed a deep learning method called H3-OPT, which combines the strength of AF2 and PLM to reach better prediction accuracy of antibody CDR-H3 loops than AF2 and IgFold. These improvements will have an impact on antibody structure prediction and design.

      Strengths:

      The training data are carefully selected and clustered, the network design is simple and effective.

      The improvements include smaller average Ca RMSD, backbone RMSD, side chain RMSD, more accurate surface residues and/or SASA, and more accurate H3 loop-antigen contacts.

      The performance is validated from multiple angles.

      Weaknesses:

      1) There are very limited prediction-then-validation cases, basically just one case.

      Thanks for pointing out this issue. The number of prediction-then-validation cases is helpful to show the generalization ability of our model. However, obtaining experimental structures is both costly and labor-intensive. Furthermore, experimental validation cases only capture a limited portion of the sequence space in comparison to the broader diversity of antibody sequences.

      To address this challenge, we have collected different datasets to serve as benchmarks for evaluating the performance of H3-OPT, including our non-redundant test set and the CAMEO dataset. The introduction of these datasets allows for effective assessments of H3-OPT’s performance without biases and tackles the obstacle of limited prediction-then-validation cases.

      Reviewer #2 (Public Review):

      This work provides a new tool (H3-Opt) for the prediction of antibody and nanobody structures, based on the combination of AlphaFold2 and a pre-trained protein language model, with a focus on predicting the challenging CDR-H3 loops with enhanced accuracy than previously developed approaches. This task is of high value for the development of new therapeutic antibodies. The paper provides an external validation consisting of 131 sequences, with further analysis of the results by segregating the test sets into three subsets of varying difficulty and comparison with other available methods. Furthermore, the approach was validated by comparing three experimentally solved 3D structures of anti-VEGF nanobodies with the H3-Opt predictions

      Strengths:

      The experimental design to train and validate the new approach has been clearly described, including the dataset compilation and its representative sampling into training, validation and test sets, and structure preparation. The results of the in-silico validation are quite convincing and support the authors' conclusions.

      The datasets used to train and validate the tool and the code are made available by the authors, which ensures transparency and reproducibility, and allows future benchmarking exercises with incoming new tools.

      Compared to AlphaFold2, the authors' optimization seems to produce better results for the most challenging subsets of the test set.

      Weaknesses:

      1) The scope of the binding affinity prediction using molecular dynamics is not that clearly justified in the paper.

      We sincerely appreciate your valuable comment. We have added the following sentence in our manuscript to justify the scope of the molecular dynamics calculations: “Since affinity prediction plays a crucial role in antibody therapeutics engineering, we performed MD simulations to compare the differences in binding affinities between AF2-predicted complexes and H3-OPT-predicted complexes.”.

      2) Some parts of the manuscript should be clarified, particularly the ones that relate to the experimental validation of the predictions made by the reported method. It is not absolutely clear whether the experimental validation is truly a prospective validation. Since the methodological aspects of the experimental determination are not provided here, it seems that this may not be the case. This is a key aspect of the manuscript that should be described more clearly.

      Thank you for the reminder about experimental validation of our predictions. The sequence identities of the wild-type nanobody VH domain and H3 loop, when compared with the best template, are 0.816 and 0.647, respectively. As a result, these mutants exhibited low sequence similarity to our dataset, indicating the absence of prediction bias for these targets. Thus, H3-OPT outperformed IgFold on these mutants, demonstrating our model's strong generalization ability. In summary, the experimental validation actually serves as a prospective validation.

      Thanks for your comments, we have added the following sentence to provide the methodological aspects of the experimental determination: “The protein expression, purification and crystallization experiments were described previously. The proteins used in the crystallization experiments were unlabeled. Upon thawing the frozen protein on ice, we performed a centrifugation step to eliminate any potential crystal nucleus and precipitants. Subsequently, we mixed the protein at a 1:1 ratio with commercial crystal condition kits using the sitting-drop vapor diffusion method facilitated by the Protein Crystallization Screening System (TTP LabTech, mosquito). After several days of optimization, single crystals were successfully cultivated at 21°C and promptly flash-frozen in liquid nitrogen. The diffraction data from various crystals were collected at the Shanghai Synchrotron Research Facility and subsequently processed using the aquarium pipeline.”

      3) Some Figures would benefit from a clearer presentation.

      We sincerely thanks for your careful reading. According to your comments, we have made extensive modifications to make our presentation more convincing and clearer (Fig 2c-f).

      Author response image 2.

      Reviewer #3 (Public Review):

      Summary:

      The manuscript introduces a new computational framework for choosing 'the best method' according to the case for getting the best possible structural prediction for the CDR-H3 loop. The authors show their strategy improves on average the accuracy of the predictions on datasets of increasing difficulty in comparison to several state-of-the-art methods. They also show the benefits of improving the structural predictions of the CDR-H3 in the evaluation of different properties that may be relevant for drug discovery and therapeutic design.

      Strengths:

      The authors introduce a novel framework, which can be easily adapted and improved. The authors use a well-defined dataset to test their new method. A modest average accuracy gain is obtained in comparison to other state-of-the art methods for the same task while avoiding testing different prediction approaches.

      Weaknesses:

      1) The accuracy gain is mainly ascribed to easy cases, while the accuracy and precision for moderate to challenging cases are comparable to other PLM methods (see Fig. 4b and Extended Data Fig. 2). That raises the question: how likely is it to be in a moderate or challenging scenario? For example, it is not clear whether the comparison to the solved X-ray structures of anti-VEGF nanobodies represents an easy or challenging case for H3-OPT. The mutant nanobodies seem not to provide any further validation as the single mutations are very far away from the CDR-H3 loop and they do not disrupt the structure in any way. Indeed, RMSD values follow the same trend in H3-OPT and IgFold predictions (Fig. 4c). A more challenging test and interesting application could be solving the structure of a designed or mutated CDR-H3 loop.

      Thank you for your rigorous consideration. When the experimental structure is unavailable, it is difficult to directly determinate whether the target is easy-to-predict or challenging. We have conducted our non-redundant test set in which the number of easy-to-predict targets is comparable to the other two groups. Due to the limited availability of experimental antibody structures, especially nanobody structures, accurately predicting CDR-H3 remains a challenge. In our manuscript, we discuss the strengths and weakness of AlphaFold2 and other PLM-based methods, and we introduce H3-OPT as a comprehensive solution for antibody CDR3 modeling.

      We also appreciate your comment on experimental structures. We fully agree with your opinion and made attempts to solve the experimental structures of seven mutants, including two mutants (Y95F and Q118N) which are close to CDR-H3 loop. Unfortunately, we tried seven different reagent kits with a total of 672 crystallization conditions, but were unable to obtain crystals for these mutants. Despite the mutants we successfully solved may not have significantly disrupted the structures of CDR-H3 loops, they have still provided valuable insights into the differences between MSA-based methods and MSA-free methods (such as IgFold) for antibody structure modeling.

      We have further conducted a benchmarking study using two examples, PDBID 5U15 and 5U0R, both consisting of 18 residues in CDR-H3, to evaluate H3-OPT's performance in predicting mutated H3 loops. In the first case (target 5U15), AlphaFold2 failed to provide an accurate prediction of the extended orientation of the H3 loop, resulting in a less accurate prediction (Cα-RMSD = 10.25 Å) compared to H3-OPT (Cα-RMSD = 5.56 Å). In the second case (target 5U0R, a mutant of 5U15 in CDR3 loop), AlphaFold2 and H3-OPT achieved Cα-RMSDs of 6.10 Å and 4.25 Å, respectively. Additionally, the Cα-RMSDs of OmegaFold predictions were 8.05 Å and 9.84 Å, respectively. These findings suggest that both AlphaFold2 and OmegaFold effectively captured the mutation effects on conformations but achieved lower accuracy in predicting long CDR3 loops when compared to H3-OPT.

      2) The proposed method lacks a confidence score or a warning to help guide the users in moderate to challenging cases.

      We appreciate your suggestions and we have trained a separate module to predict confidence scores. We used the MSE loss for confidence prediction, where the label error was calculated as the Cα deviation of each residue after alignment. The inputs of this module are the same as those used for H3-OPT, and it generates a confidence score ranging from 0 to 100.

      3) The fact that AF2 outperforms H3-OPT in some particular cases (e.g. Fig. 2c and Extended Data Fig. 3) raises the question: is there still room for improvements? It is not clear how sensible is H3-OPT to the defined parameters. In the same line, bench-marking against other available prediction algorithms, such as OmegaFold, could shed light on the actual accuracy limit. We totally understand your concern. Many papers have suggested that PLM-based models are computationally efficient but may have unsatisfactory accuracy when high-resolution templates and MSA are available (Fast, accurate antibody structure prediction from deep learning on massive set of natural antibodies, Ruffolo, J. A. et al, 2023). However, the accuracy of AF2 decreased substantially when the MSA information is limited. Therefore, we directly retained high-confidence structures of AF2 and introduced a PSPM to improve the accuracy of the targets with long CDR-H3 loops and few sequence homologs. The improvement in mean Cα-RMSD demonstrated the room for accurately predicting CDR-H3 loops.

      We also appreciate your kind comment on defined parameters. In fact, once a benchmark dataset is established, determining an optimal cutoff value through parameter searching can indeed further improve the performance of H3-OPT in CDR3 structure prediction. However, it is important to note that this optimal cutoff value heavily depends on the testing dataset being used. Therefore, we provide a recommended cutoff value and offer a program interface for users who wish to manually define the cutoff value based on their specific requirements. Here, we showed the average Cα-RMSDs of our test set under different confidence cutoffs and the results have been added in the text accordingly.

      Author response table 1.

      We also appreciate your reminder, and we have conducted a benchmark against OmegaFold. The results have been included in the manuscript (Fig 4a-b).

      Author response image 3.

      Reviewer #1 (Recommendations For The Authors):

      1) In Fig 3a, please also compare IgFold and H3-OPT (merge Fig. S2 into Fig 3a)

      In Fig 3b, please separate Sub2 and Sub3, and add IgFold's performance.

      Thank you very much for your professional advice. We have made revisions to the figures based on your suggestions.

      Author response image 4.

      2) For the three experimentally solved structures of anti-VEGF nanobodies, what are the sequence identities of the VH domain and H3 loop, compared to the best available template? What is the length of the H3 loop? Which category (Sub1/2/3) do the targets belong to? What is the performance of AF2 or AF2-Multimer on the three targets?

      We feel sorry for these confusions. The sequence identities of the VH domain and H3 loop are 0.816 and 0.647, respectively, comparing with the best template. The CDR-H3 lengths of these nanobodies are both 17. According to our classification strategy, these nanobodies belong to Sub1. The confidence scores of these AlphaFold2 predicted loops were all higher than 0.8, and these loops were accepted as the outputs of H3-OPT by CBM.

      3) Is AF2-Multimer better than AF2, when using the sequences of antibody VH and antigen as input?

      Thanks for your suggestions. Many papers have benchmarked AlphaFold2-Multimer for protein complex modeling and demonstrated the accuracy of AlphaFold2-Multimer on predicting the protein complex is far from satisfactory (Benchmarking AlphaFold for protein complex modeling reveals accuracy determinants, Rui Yin, et al., 2022). Additionally, there is no significantly difference between AlphaFold2 and AlphaFold2-Multimer on antibody modeling (Structural Modeling of Nanobodies: A Benchmark of State-of-the-Art Artificial Intelligence Programs, Mario S. Valdés-Tresanco, et al., 2023)

      From the data perspective, we employed a non-redundant dataset for training and validation. Since these structures are valuable, considering the antigen sequence would reduce the size of our dataset, potentially leading to underfitting.

      4) For H3 loop grafting, I noticed that only identical target and template H3 sequences can trigger grafting (lines 348-349). How many such cases are in the test set?

      We appreciate your comment from this perspective. There are thirty targets in our database with identical CDR-H3 templates.

      Reviewer #2 (Recommendations For The Authors):

      • It is not clear to me whether the three structures apparently used as experimental confirmation of the predictions have been determined previously in this study or not. This is a key aspect, as a retrospective validation does not have the same conceptual value as a prospective, a posteriori validation. Please note that different parts of the text suggest different things in this regard "The model was validated by experimentally solving three structures of anti-VEGF nanobodies predicted by H3-OPT" is not exactly the same as "we then sought to validate H3-OPT using three experimentally determined structures of anti-VEGF nanobodies, including a wild-type (WT) and two mutant (Mut1 and Mut2) structures, that were recently deposited in protein data bank". The authors are kindly advised to make this point clear. By the way, "protein data bank" should be in upper case letters.

      We gratefully thank you for your feedback and fully understand your concerns. To validate the performance of H3-OPT, we initially solved the structures of both the wild-type and mutants of anti-VEGF nanobodies and submitted these structures to Protein Data Bank. We have corrected “that were recently deposited in protein data bank” into “that were recently deposited in Protein Data Bank” in our revised manuscript.

      • It would be good to clarify the goal and importance of the binding affinity prediction, as it seems a bit disconnected from the rest of the paper. Also, it would be good to include the production MD runs as Sup, Mat.

      Thanks for your valuable comment. We have added the following sentence in our manuscript to clarify the goal and importance of the molecular dynamics calculations: “Since affinity prediction plays a crucial role in antibody therapeutics engineering, we performed MD simulations to compare the differences in binding affinities between AF2-predicted complexes and H3-OPT-predicted complexes.”. The details of production runs have been described in Method section.

      • Has any statistical test been performed to compare the mean Cα-RMSD values across the modeling approaches included in the benchmark exercise?

      Thanks for this kind recommendation. We conducted a statistical test to assess the performance of different modeling approaches and demonstrated significant improvements with H3-OPT compared to other methods (p<0.001). Additionally, we have trained H3-OPT with five random seeds and compared mean Cα-RMSD values with all five models of AF2. Here, we showed the average Cα-RMSDs of H3-OPT and AlphaFold2.

      Author response table 1.

      • In Fig. 2c-f, I think it would be adequate to make the ordering criterion of the data points explicit in the caption or the graph itself.

      We appreciate your comment and suggestion. We have revised the graph in the manuscript accordingly.

      Author response image 5.

      • Please revise Figure S2 caption and/or its content. It is not clear, in parts b and c, which is the performance of H3-OPT. Why weren´t some other antibody-specific tools such as IgFold included in this comparison?

      Thanks for your comments. The performance of H3-OPT is not included in Figure S2. Prior to training H3-OPT, we conducted several preliminary studies, and the detailed results are available in the supplementary sections. We showed that AlphaFold2 outperformed other methods (including AI-based methods and TBM methods) and produced sub-angstrom predictions in framework regions. The comparison of IgFold with other methods was discussed in a previous work (Fast, accurate antibody structure prediction from deep learning on massive set of natural antibodies, Ruffolo, J. A. et al, 2023). In that study, we found that IgFold largely yielded results comparable to AlphaFold2 but with lower prediction cost. Additionally, we have also conducted a detailed comparison of CDR-H3 loops with IgFold in our main text.

      • It is stated that "The relative binding affinities of the antigen-antibody complexes were evaluated using the Python script...". Which Python script?

      Thank you for your comments, and I apologize for the confusion. This python script is a module of AMBER software, we have corrected “The relative binding affinities of the antigen-antibody complexes were evaluated using the python script” into “The relative binding affinities of the antigen-antibody complexes were evaluated using the MMPBSA module of AMBER software”.

      Reviewer #3 (Recommendations For The Authors):

      Does H3-OPT improve the AF2 score on the CDR-H3? It would be interesting to see whether grafted and PSPM loops improve the pLDDT score by using for example AF2Rank [https://doi.org/10.1103/PhysRevLett.129.238101]. That could also be a way to include a confidence score into H3-OPT.

      We are so grateful for your kind question. H3-OPT could not provide a confidence score for output in current version, so we did not know whether H3-OPT improve the AF2 score or not.

      We appreciate your kind recommendations and have calculated the pLDDT scores of all models predicted by H3-OPT and AF2 using AF2Rank. We showed that the average of pLDDT scores of different predicted models did not match the results of Cα-RMSD values.

      Author response table 3.

      Therefore, we have trained a separate module to predict the confidence score of the optimized CDR-H3 loops. We hope that this module can provide users with reliable guidance on whether to use predicted CDR-H3 loops.

      The test case of Nb PDB id. 8CWU is an interesting example where AF2 outperforms H3-OPT and PLMs. The top AF2 model according to ColabFold (using default options and no template [https://doi.org/10.1038/s41592-022-01488-1]) shows a remarkably good model of the CDR-H3, explaining the low Ca-RMSD in the Extended Data Fig. 3. However, the pLDDT score of the 4 tip residues (out of 12), forming the hairpin of the CDR-H3 loop, pushes down the average value bellow the CBM cut-off of 80. I wonder if there is a lesson to learn from that test case. How sensible is H3-OPT to the CBM cut-off definition? Have the authors tried weighting the residue pLDDT score by some structural criteria before averaging? I guess AF2 may have less confidence in hydrophobic tip residues in exposed loops as the solvent context may not provide enough support for the pLDDT score.

      Thanks for your valuable feedback. We showed the average Cα-RMSDs of our test set under different confidence cutoffs and the results have been added in the text accordingly.

      Author response table 4.

      We greatly appreciate your comment on this perspective. Inspired on your kind suggestions, we will explore the relationship between cutoff values and structural information in related work. Your feedback is highly valuable as it will contribute to the development of our approach.

      A comparison against the new folding prediction method OmegaFold [https://doi.org/10.1101/2022.07.21.500999] is missed. OmegaFold seems to outperform AF2, ESM, and IgFold among others in predicting the CDR-H3 loop conformation (See [https://doi.org/10.3390/molecules28103991] and [https://doi.org/10.1101/2022.07.21.500999]). Indeed, prediction of anti-VEGF Nb structure (PDB WT_QF_0329, chain B in supplementary data) by OmegaFold as implemented in ColabFold [https://colab.research.google.com/github/sokrypton/ColabFold/blob/main/beta/omegafold.ipynb] and setting 10 cycles, renders Ca-RMSD 1.472 Å for CDR-H3 (residues 98-115).

      We appreciate your valuable suggestion. We have added the comparison against OmegaFold in our manuscript. The results have been included in the manuscript (Fig 4a-b).

      Author response image 6.

      In our test set, OmegaFold outperformed ESMFold in predicting the CDR-H3 loop conformation. However, it failed to match the accuracy of AF2, IgFold, and H3-OPT. We discussed the difference between MSA-based methods (such as AlphaFold2) and MSA-free methods (such as IgFold) in predicting CDR-H3 loops. Similarly, OmegaFold provided comparative results with HelixFold-Single and other MSA-free methods but still failed to match the accuracy of AlphaFold2 and H3-OPT on Sub1.

      The time-consuming step in H3-OPT is the AF2 prediction. However, most of the time is spent in modeling the mAb and Nb scaffolds, which are already very well predicted by PLMs (See Fig. 4 in [https://doi.org/10.3390/molecules28103991]). Hence, why not use e.g. OmegaFold as the first step, whose score also correlates to the RMSD values [https://doi.org/10.3390/molecules28103991]? If that fails, then use AF2 or grafting. Alternatively, use a PLM model to generate a template, remove/mask the CDR loops (at least CDR-H3), and pass it as a template to AF2 to optimize the structure with or without MSA (e.g. using AF2Rank).

      Thanks for your professional feedbacks. It is really true that the speed of MSA searching limited the application of high-throughput structure prediction. Previous studies have demonstrated that the deep learning methods performed well on framework residues. We once tried to directly predict the conformations of CDR-H3 loops using PLM-based methods, but this initial version of H3-OPT lacking the CBM could not replicate the accuracy of AF2 in Sub1. Similarly, we showed that IgFold and OmegaFold also provide lower accuracy in Sub1 (average Cα-RMSD is 1.71 Å and 1.83 Å, respectively, whereas AF2 predicted an average of 1.07 Å). Therefore, The predictions of AlphaFold2 not only produce scaffolds but also provide the highest quality of CDR-H3 loops when high-resolution templates and MSA are available.

      Thank you once again for your kind recommendation. In the current version of H3-OPT, we have highlighted the strengths of H3-OPT in combining the AF2 and PLM models in various scenarios. AF2 can provide accurate predictions for short loops with fewer than 10 amino acids, and PLM-based models show little or no improvement in such cases. In the next version of H3-OPT, as the first step, we plan to replace the AF2 models with other methods if any accurate MSA-free method becomes available in the future.

      Line 115: The statement "IgFold provided higher accuracy in Sub3" is not supported by Fig. 2a.

      We are sorry for our carelessness. We have corrected “IgFold provided higher accuracy in Sub3” into “IgFold provided higher accuracy in Sub3 (Fig. 3a)”.

      Lines 195-203: What is the statistical significance of results in Fig 5a and 5b?

      Thank you for your kind comments. The surface residues of AF2 models are significantly higher than those of H3-OPT models (p < 0.005). In Fig. 5b, H3-OPT models predicted lower values than AF2 models in terms of various surface properties, including polarity (p <0.05) and hydrophilicity (p < 0.001).

      Lines 212-213: It is not easy to compare and quantify the differences between electrostatic maps in Fig. 5d. Showing a Dmap (e.g. mapmodel - mapexperiment) would be a better option. Additionally, there is no methodological description of how the maps were generated nor the scale of the represented potential.

      Thank you for pointing this out. We have modified the figure (Fig. 5d) according to your kind recommendation and added following sentences to clarify the methodological description on the surface electrostatic potential:

      “Analysis of surface electrostatic potential

      We generated two-dimensional projections of CDR-H3 loop’s surface electrostatic potential using SURFMAP v2.0.0 (based on GitHub from February 2023: commit: e0d51a10debc96775468912ccd8de01e239d1900) with default parameters. The 2D surface maps were calculated by subtracting the surface projection of H3-OPT or AF2 predicted H3 loops to their native structures.”

      Author response image 7.

      Lines 237-240 and Table 2: What is the meaning of comparing the average free energy of the whole set? Why free energies should be comparable among test cases? I think the correct way is to compare the mean pair-to-pair difference to the experimental structure. Similarly, reporting a precision in the order of 0.01 kcal/mol seems too precise for the used methodology, what is the statistical significance of the results? Were sampling issues accounted for by performing replicates or longer MDs?

      Thanks for your rigorous advice and pointing out these issues. We have modified the comparisons of free energies of different predicted methods and corrected the precision of these results. The average binding free energies of H3-OPT complexes is lower than AF2 predicted complexes, but there is no significant difference between these energies (p >0.05).

      Author response table 4.

      Comparison of binding affinities obtained from MD simulations using AF2 and H3-OPT.

      Thanks for your comments on this perspective. Longer MD simulations often achieve better convergence for the average behavior of the system, while replicates provide insights into the variability and robustness of the results. In our manuscript, each MD simulation had a length of 100 nanoseconds, with the initial 90 nanoseconds dedicated to achieving system equilibrium, which was verified by monitoring RMSD (Root Mean Square Deviation). The remaining 10 nanoseconds of each simulation were used for the calculation of free energy. This approach allowed us to balance the need for extensive sampling with the verification of system stability.

      Regarding MD simulations for CDR-H3 refinement, its successful application highly depends on the starting conformation, the force field, and the sampling strategy [https://doi.org/10.1021/acs.jctc.1c00341]. In particular, the applied plan MD seems a very limited strategy (there is not much information about the simulated times in the supplementary material). Similarly, local structure optimizations with QM methods are not expected to improve a starting conformation that is far from the experimental conformation.

      Thank you very much for your valuable feedback. We fully agree with your insights regarding the limitations of MD simulations. Before training H3-OPT, we showed the challenge of accurately predicting CDR-H3 structures. We then tried to optimize the CDR-H3 loops by computational tools, such as MD simulations and QM methods (detailed information of MD simulations is provided in the main text). Unfortunately, these methods were not expected to improve the accuracy of AF2 predicted CDR-H3 loops. These results showed that MD simulations and QM methods not only are time-consuming, but also failed to optimize the CDR-H3 loops. Therefore, we developed H3-OPT to tackle these issues and improve the accuracy of CDR3-H3 for the development of antibody therapeutics.

      Text improvements

      Relevant statistical and methodological parameters are presented in a dispersed manner throughout the text. For example, the number of structures in test, training, and validation datasets is first presented in the caption of Fig. 4. Similarly, the sequence identity % to define redundancy is defined in the caption of Fig. 1a instead of lines 87-88, where authors define "we constructed a non-redundant dataset with 1286 high-resolution (<2.5 Å)". Is the sequence redundancy for the CDR-H3 or the whole mAb/Nb?

      Thank you for pointing out these issues. We have added the number of structures in each subgroup in the caption of Fig. 1a: “Clustering of the filtered, high-resolution structures yielded three datasets for training (n = 1021), validation (n = 134), and testing (n = 131).” and corrected “As data quality has large effects on prediction accuracy, we constructed a non-redundant dataset with 1286 high-resolution (<2.5 Å) antibody structures from SAbDab” into “As data quality has large effects on prediction accuracy, we constructed a non-redundant dataset (sequence identity < 0.8) with 1286 high-resolution (<2.5 Å) antibody structures from SAbDab” in the revised manuscript. The sequence redundancy applies to the whole mAb/Nb.

      The description of ablation studies is not easy to follow. For example, what does removing TGM mean in practical terms (e.g. only AF2 is used, or PSPM is applied if AF2 score < 80)? Similarly, what does removing CBM mean in practical terms (e.g. all AF2 models are optimized by PSPM, and no grafting is done)? Thanks for your comments and suggestions. We have corrected “d, Differences in H3-OPT accuracy without the template module. e, Differences in H3-OPT accuracy without the CBM. f, Differences in H3-OPT accuracy without the TGM.” into “d, Differences in H3-OPT accuracy without the template module. This ablation study means only PSPM is used. e, Differences in H3-OPT accuracy without the CBM. This ablation study means input loop is optimized by TGM and PSPM. f, Differences in H3-OPT accuracy without the TGM. This ablation study means input loop is optimized by CBM and PSPM.”.

      Authors should report the values in the text using the same statistical descriptor that is used in the figures to help the analysis by the reader. For example, in lines 223-224 a precision score of 0.75 for H3-OPT is reported in the text (I assume this is the average value), while the median of ~0.85 is shown in Fig. 6a.

      Thank you for your careful checks. We have corrected “After identifying the contact residues of antigens by H3-OPT, we found that H3-OPT could substantially outperform AF2 (Fig. 6a), with a precision of 0.75 and accuracy of 0.94 compared to 0.66 precision and 0.92 accuracy of AF2.” into “After identifying the contact residues of antigens by H3-OPT, we found that H3-OPT could substantially outperform AF2 (Fig. 6a), with a median precision of 0.83 and accuracy of 0.97 compared to 0.64 precision and 0.95 accuracy of AF2.” in proper place of manuscript.

      Minor corrections

      Lines 91-94: What do length values mean? e.g. is 0-2 Å the RMSD from the experimental structure?

      We appreciate your comment and apologize for any confusion. The RMSD value is actually from experimental structure. The RMSD value evaluates the deviation of predicted CDR-H3 loop from native structure and also represents the degree of prediction difficulty in AlphaFold2 predictions. We have added following sentence in the proper place of the revised manuscript: “(RMSD, a measure of the difference between the predicted structure and an experimental or reference structure)”.

      Line 120: is the "AF2 confidence score" for the full-length or CDR-H3?

      We gratefully appreciate for your valuable comment and have corrected “Interestingly, we observed that AF2 confidence score shared a strong negative correlation with Cα-RMSDs (Pearson correlation coefficient =-0.67 (Fig. 2b)” into “Interestingly, we observed that AF2 confidence score of CDR-H3 shared a strong negative correlation with Cα-RMSDs (Pearson correlation coefficient =-0.67 (Fig. 2b)” in the revised manuscript.

      Line 166: Do authors mean "Taken" instead of "Token"?

      We are really sorry for our careless mistakes. Thank you for your reminder.

      Line 258: Reference to Fig. 1 seems wrong, do authors mean Fig. 4?

      We sincerely thank the reviewer for careful reading. As suggested by the reviewer, we have corrected the “Fig. 1” into “Fig. 4”.

      Author response image 7.

      Point out which plot corresponds to AF2 and which one to H3-OPT

      Thanks for pointing out this issue. We have added the legends of this figure in the proper positions in our manuscript.

    1. Author Response

      The following is the authors’ response to the current reviews.

      We thank both reviewers for their detailed and positive assessment of our work.

      To Reviewer #2, we have now explicated the pattern -- (QXQXQX>3)4 where X>3 denotes any length of three or more residues of any composition -- in the first paragraph of the discussion.

      To Reviewer #3, we have made slight modifications to the text in the “Q zippers poison themselves” results section, to attempt to further clarify the mechanism of self-poisoning.

      Briefly, the reviewer questions if an alternative model -- where inhibition involves non-structured rather than Q-zipper containing oligomers -- better explains the data. We provided two lines of evidence that we believe exclude this alternative model. First, we point out in the first paragraph of the “Q zippers poison themselves” section that the cells that unexpectedly lack amyloid in the high concentration regime have negligible levels of AmFRET, indicating that the inhibitory oligomers themselves occur at low concentrations regardless of the total concentration, and are therefore limited by a kinetic barrier. Second, we point out in the third paragraph of the section that the severity of amyloid inhibition with respect to concentration has a sequence dependence that matches the expectation of converging phase boundaries for crystal polymorphs -- specifically, inhibition is most severe for sequences that have a local Q density just high enough to form a Q zipper on both sides of each strand. Inhibition relaxed for sequences having more or less Qs than that threshold. In contrast, disordered oligomerization is not expected to have such a dependence on the precise pattern of Qs and Ns.


      The following is the authors’ response to the original reviews.

      We are pleased that the editors find our study valuable. We find that the reviewers’ criticisms largely arise from misunderstandings inherent to the conceptually challenging nature of the topic, rather than fundamental flaws, as we will elaborate here. We are grateful for the opportunity afforded by eLife to engage reviewers in what we intend to be a constructive public dialogue.

      Response to Reviewer 1

      This review is highly critical but lacks specifics. The reviewer’s criticisms reflect a position that seems to dismiss a critical role for (or perhaps even the existence of) conformational ordering in polyQ amyloid, which is untenable.

      The reviewer states that our objective to characterize the amyloid nucleus “rests on the assertion that polyQ forms amyloid structures to the exclusion of all other forms of solids”. We do not fully agree with this assertion because our findings show that detectable aggregation is rate-limited by conformational ordering, as evident by 1) its discontinuous relationship to concentration, 2) its acceleration by a conformational template, and 3) its strict dependence on very specific sequence features that are consistent with amyloid structure but not disordered aggregation).

      We strongly disagree with the reviewer’s subjective statement that we have not critically assessed our findings and that they do not stand up to scrutiny. This statement seems to rest on the perceived contradiction of our findings with that of Crick et al. 2013. Contrary to the reviewer’s assessment, we argue here that the conclusions of Crick et al. do more to support than to refute our findings. Briefly, Crick et al. investigated the aggregation of synthetic Q30 and Q40 peptides in vitro, wherein fibrils assembled from high concentrations of peptide were demonstrated to have saturating concentrations in the low micromolar range. As explained below, this finding of a saturating concentration does not refute our results. More relevant to the present work are their findings that “oligomers” accumulated over an hours-long timespan in solutions that are subsaturated with respect to fibrils, and these oligomers themselves have (nanomolar) critical concentrations. The authors postulated that the oligomers result from liquid–liquid demixing of intrinsically disordered polyglutamine. However, phase separation by a peptide is expected to fix its concentration in both the solute and condensed phases, and, because disordered phase separation is faster than amyloid formation, the postulated explanation removes the driving force for any amyloid phase with a critical solubility greater than that of the oligomers. In place of this interpretation that truly does appear to -- in the reviewer’s words -- “contradict basic physical principles of how homopolymers self-assemble”, we interpret these oligomers as evidence of Q zipper-containing self-poisoned multimers, rounded as an inherent consequence of self-poisoning (Ungar et al., 2005), and plausibly akin to semicrystalline spherulites that have been observed in other polymer crystal and amyloid-forming systems (Crist and Schultz, 2016; Vetri and Foderà, 2015). Importantly, the physical parameters governing the transition between amyloid spherulites and fibrils have been characterized in the case of insulin (Smith et al. 2012), where it was found that spherulites form at lower protein concentrations than fibrils. This mirrors the observation by Crick et al. that fibrils have a higher solubility limit than the spherical oligomers. . Further rebuttal to the perceived incompatibility of monomeric nucleation with the existence of a critical concentration for amyloid

      We appreciate that the concept of a monomeric nucleus can superficially appear inconsistent with the fact that crystalline solids such as polyQ amyloid have a saturating concentration, but this is only true if one neglects that polyQ amyloids are polymer crystals with intramolecular ordering. The perceived discrepancy is perhaps most easily dispelled by the fact that folded proteins can form crystals, and the folded state of the protein. These crystals have critical concentrations, and the protein subunits within them each have intramolecular crystalline order (in the form of secondary structure). When placed in a subsaturated solution, the protein crystals dissolve into the constituent monomers, and yet those monomers still retain intramolecular order. Our present findings for polyQ are conceptually no different.

      To further extrapolate this simple example to polyQ, one can also draw on the now well-established phenomenon of secondary nucleation, whereby transient interactions of soluble species with ordered species leads to their own ordering (Törnquist et al., 2018). Transience is important here because it implies that intramolecular ordering can in principle propagate even in solutions that are subsaturated with respect to bulk crystallization. This is possible in the present case because the pairing of sufficiently short beta strands (equivalent to “stems” in the polymer crystal literature) will be more stable intramolecularly than intermolecularly, due to the reduced entropic penalty of the former. Our elucidation that Q zipper ordering can occur with shorter strands intramolecularly than intermolecularly (Fig. S4C-D) demonstrates this fact. It is also evident from published descriptions of single molecule “crystals” formed in sufficiently dilute solutions of sufficiently long polymers (Hong et al., 2015; Keller, 1957; Lauritzen and Hoffman, 1960).

      In suggesting that a saturating concentration for amyloid rules out monomeric nucleation, the reviewer assumes that the Q zipper-containing monomer must be stable relative to the disordered ensemble. This is not inherent to our claim. The monomeric nucleating structure need not be more stable than the disordered state, and monomers may very well be disordered at equilibrium at low concentrations. To be clear, our claim requires that the Q zipper-containing monomer is both on pathway to amyloid and less stable than all subsequent species that are on pathway to amyloid. The former requirement is supported by our extensive mutational analysis. The latter requirement is supported by our atomistic simulations showing the Q zipper-containing monomer is stabilized by dimerization (included in our 2021 preprint). Hence, requisite ordering in the nucleating monomer is stabilized by intermolecular interactions. We provide in Author response image 1 an illustration to clarify what we believe to be the discrepancy between our claim and the reviewer’s interpretation.

      Author response image 1.

      That the rate-limiting fluctuation for a crystalline phase can occur in a monomer can also be understood as a consequence of Ostwald’s rule of stages, which describes the general tendency of supersaturated solutes, including amyloid forming proteins (Chakraborty et al., 2023), to populate metastable phases en route to more stable phases (De Yoreo, 2022; Schmelzer and Abyzov, 2017). Our findings with polyQ are consistent with a general mechanism for Ostwald’s rule wherein the relative stabilities of competing polymorphs differ with the number of subunits (De Yoreo, 2022; Navrotsky, 2004). As illustrated in Fig. 6 of Navrotsky, a polymorph that is relatively stable at small particle sizes tends to give way to a polymorph that -- while initially unstable -- becomes more stable as the particles grow. The former is analogous to our early stage Q zipper composed of two short sheets with an intramolecular interface, while the latter is analogous to the later stage Q zipper composed of longer sheets with an intermolecular interface. Subunit addition stabilizes the latter more than the former, hence the initial Q zipper that is stabilized more by intra- than intermolecular interactions will mature with growth to one that is stabilized more by intermolecular interactions.

      We have added a new figure (Fig. 6) to the manuscript to illustrate qualitative features of the amyloid pathway we have deduced for polyQ.

      Rebuttal to the perceived necessity of in vitro experiments

      The overarching concern of this reviewer and reviewing editor is whether in-cell assays can inform on sequence-intrinsic properties. We understand this concern. We believe however that the relative merit of in-cell assays is largely a matter of perspective. The truly sequence-intrinsic behavior of polyQ, i.e. in a vacuum, is less informative than the “sequence-intrinsic” behaviors of interest that emerge in the presence of extraneous molecules from the appropriate biological context. In vitro experiments typically include a tiny number of these -- water, ions, and sometimes a crowding agent meant to approximate everything else. Obviously missing are the myriad quinary interactions with other proteins that collectively round out the physiological solvent. The question is what experimental context best approximates that of a living human neuron under which the pathological sequence-dependent properties of polyQ manifest. We submit that a living yeast cell comes closer to that ideal than does buffer in a test tube.

      The reviewer’s statements that our findings must be validated in vitro ignores the fact -- stressed in our introduction -- that decades of in vitro work have not yet generated definitive evidence for or against any specific nucleus model. In addition to the above, one major problem concerns the large sizes of in vitro systems that obscure the effects of primary nucleation. For example, a typical in vitro experimental volume of e.g. 1.5 ml is over one billion-fold larger than the femtoliter volume of a cell. This means that any nucleation-limited kinetics of relevant amyloid formation are lost, and any alternative amyloid polymorphs that have a kinetic growth advantage -- even if they nucleate at only a fraction the rate of relevant amyloid -- will tend to dominate the system (Buell, 2017). Novel approaches are clearly needed to address these problems. We present such an approach, stretch it to the limit (as the reviewer notes) across multiple complementary experiments, and arrive at a novel finding that is fully and uniquely consistent with all of our own data as well as the collective prior literature.

      That the preceding considerations are collectively essential to understand relevant amyloid behavior is evident from recent cryoEM studies showing that in vitro-generated amyloid structures generally differ from those in patients (Arseni et al., 2022; Bansal et al., 2021; Radamaker et al., 2021; Schmidt et al., 2019; Schweighauser et al., 2020; Yang et al., 2022). This is highly relevant to the present discourse because each amyloid structure is thought to emanate from a different nucleating structure. This means that in vitro experiments have broadly missed the mark in terms of the relevant thermodynamic parameters that govern disease onset and progression. Note that the rules laid out via our studies are not only consistent with structural features of polyQ amyloid in cells, but also (as described in the discussion) explain why the endogenous structure of a physiologically relevant Q zipper amyloid differs from that of polyQ.

      A recent collaboration between the Morimoto and Knowles groups (Sinnige et al.) investigated the kinetics of aggregation by Q40-YFP expressed in C. elegans body wall muscle cells, using quantitative approaches that have been well established for in vitro amyloid-forming systems of the type favored by the reviewer. They calculate a reaction order of just 1.6, slightly higher than what would be expected for a monomeric nucleus but nevertheless fully consistent with our own conclusions when one accounts for the following two aspects of their approach. First, the polyQ tract in their construct is flanked by short poly-Histidine tracts on both sides. These charges very likely disfavor monomeric nucleation because all possible configurations of a four-stranded bundle position the beginning and end of the Q tract in close proximity, and Q40 is only just long enough to achieve monomeric nucleation in the absence of such destabilization. Second, the protein is fused to YFP, a weak homodimer (Landgraf et al., 2012; Snapp et al., 2003). With these two considerations, our model -- which was generated from polyQ tracts lacking flanking charges or an oligomeric fusion -- predicts that amyloid nucleation by their construct will occur more frequently as a dimer than a monomer. Indeed, their observed reaction order of 1.6 supports a predominantly dimeric nucleus. Like us and others, Sinnige et al. did not observe phase separation prior to amyloid formation. This is important because it not only argues against nucleation occurring in a condensate, it also suggests that the reaction order they calculated has not been limited by the concentration-buffering effect of phase separation.

      While we agree that our conclusions rest heavily on DAmFRET data (for good reason), we do provide supporting evidence from molecular dynamics simulations, SDD-AGE, and microscopy.

      To summarize, given the extreme limitations of in vitro experiments in this field, the breadth of our current study, and supporting findings from another lab using rigorous quantitative approaches, we feel that our claims are justified without in vitro data.

      Rebuttals to other critiques

      We do not deny that flanking domains can modulate the kinetics and stability of polyQ amyloid. However, as stated and referenced in the introduction, they do not appear to change the core structure. We have also added a paragraph concerning flanking domains to the discussion, and acknowledged that “the extent to which our findings will translate in these different contexts remains to be determined.” Nevertheless, that the intrinsic behavior of the polyQ tract itself is central to pathology is evident from the fact that the nine pathologic polyQ proteins have similar length thresholds despite different functions, flanking domains, interaction partners, and expression levels.

      The reviewer states that we found nucleation potential to require 60 Qs in a row. Our data are collectively consistent with nucleation occurring at and above approximately 36 Qs, a point repeated in the paper. The reviewer may be referring to our statement, ”Sixty residues proved to be the optimum length to observe both the pre- and post-nucleated states of polyQ in single experiments”. The purpose of this statement is simply to describe the practical consideration that led us to use 60 Qs for the bulk of our assays. We do appreciate that the fraction of AmFRET-positive cells is very low for lengths just above the threshold, especially Q40. They are nevertheless highly significant (p = 0.004 in [PIN+] cells, one-tailed T-test), and we have modified the figure and text to clarify this.

      The reviewer characterizes self-poisoning as the hallmark of crystallization from polymer melts, which would be problematic for our conclusions if self-poisoning were limited to this non-physiological context. In fact the term was first used to describe crystallization from solution (Organ et al., 1989), wherein the phenomenon is more pronounced (Ungar et al., 2005).

      Response to Reviewer 2

      We thank the reviewer for their detailed and helpful critique.

      The reviewer correctly notes that the majority of our manipulations were conducted with 60-residue long tracts (which corresponds to disease onset in early adulthood), and this length facilitates intramolecular nucleation. However, we also analyzed a length series of polyQ spanning the pathological threshold, as well as a synthetic sequence designed explicitly to test the model nucleus structure with a tract shorter than the pathological threshold, and both experiments corroborate our findings.

      The reviewer mentions “several caveats” that come with our result, but their subsequent elaboration suggests they are to be interpreted more as considerations than caveats. We agree that increasing sequence complexity will tend to increase homogeneity, but this is exactly the motivation of our approach. We explicitly set out to determine the minimal complexity sequence sufficient to specify the nucleating conformation, which we ultimately identified in terms of secondary and tertiary structure. We do not specify which parts of a long polyQ tract correspond to which parts of the structure, because, as the reviewer points out, they can occur at many places. Hence, depending on the length of the polyQ tract, the nucleus we describe may have any length of sequence connecting the strand elements. We do not think that the effects of N-residue placement can be interpreted as a confounding influence on hairpin position because the striking even-odd pattern we observe implicates the sides of beta strands rather than the lengths. Moreover, we observe this pattern regardless of the residue used (Gly, Ser, Ala, and His in addition to Asn).

      We thank the reviewer for noting the novelty and plausibility of the self-poisoning connection. We would like to elaborate on our finding that self-poisoning inhibits nucleation (in addition to elongation), as this will be confusing to many readers. While self-poisoning is claimed to inhibit primary nucleation in the polymer crystal literature (Ungar et al., 2005; Zhang et al., 2018), the semantics of “nucleation” in this context warrants clarification. Technically, the same structure can be considered a nucleus in one context but not in another. The Q zipper monomer, even if it is rate-limiting for amyloid formation at low concentrations (and is therefore the “nucleus”), is not necessarily rate-limiting when self-poisoned at high concentrations. Whether it comprises the nucleus in this case depends on the rates of Q zipper formation relative to subunit addition to the poisoned state. If the latter happens slower than Q zipper formation de novo, it can be said that self-poisoning inhibits nucleation, regardless of whether the Q zipper formed. We suspect this to be the mechanism by which preemptive oligomerization blocks nucleation in the case of polyQ, though other mechanisms may be possible.

      We believe the revised text also now incorporates the remaining suggestions of this reviewer, with two exceptions. 1) We retain the phrase “hidden pattern”, because we believe our data argue for a nucleus whose formation requires that Qs occur in a pattern that we now elaborate as (QXQXQX>3)4 where X>3 denotes any length of three or more residues of any composition. In amyloids formed from long polyQ molecules, the nucleus will involve any subset of 12 Qs that match this pattern. 2) We decided not to re-order the mansucript to discuss self-poisoning after establishing the monomer nucleus (even though we agree that doing so would improve the logical flow) because the interpretation of the data with respect to self-poisoning helps to establish critical strand lengths, and self-poisoning creates an anomaly in the DAmFRET data that is difficult to ignore. We add text clarifying that high local concentrations “effectively shifts the rate-limiting step to the growth of a higher order relatively-disordered species”.

      Response to Reviewer 3

      We thank the reviewer for their helpful comments.

      We opted to retain Figures 1A and B because we think they are important for comprehending the subject and objectives of the study. We modified the former to attempt to make it more clear. We have also elaborated on DAmFRET as it is a relatively new approach that may be unfamiliar to many readers. Beyond this, we refer the reviewer and readers to our cited prior work describing the theory and interpretation of DAmFRET. Note that the y-axes of DAmFRET plots are not raw FRET but rather “AmFRET”, a ratio of FRET to total expression level. As explained thoroughly in our cited prior work, the discontinuity of AmFRET with expression level indicates that the high AmFRET-population formed via a disorder-to-order transition. When the query protein is predicted to be intrinsically disordered, the discontinuous transition to high AmFRET invariably (among hundreds of proteins tested in prior published and unpublished work) signifies amyloid formation as corroborated by SDD-AGE and tinctorial assays.

      When performed using standard flow cytometry as in the present study, every AmFRET measurement corresponds to a cell-wide average, and hence does not directly inform on the distribution of the protein between different stoichiometric species. As there is only one fluorophore per protein molecule, monomeric nuclei have no signal. DAmFRET can distinguish cells expressing monomers from stable dimers from higher order oligomers (see e.g. Venkatesan et al. 2019), and we are therefore quite confident that AmFRET values of zero correspond to cells in which a vast majority of the respective protein is not in homo-oligomeric species (i.e. is monomeric or in hetero-complexes with endogenous proteins). The exact value of AmFRET, even for species with the same stoichiometry, will depend both on the effect of their respective geometries on the proximity of mEos3.1 fluorophores, and on the fraction of protein molecules in the species. Hence, we only attempt to interpret the plateau values of AmFRET (where the fraction of protein in an assembled state approaches unity) as directly informing on structure, as we did in Fig. S3A.

      We believe that AmFRET decreases with longer polyQ because the mass fraction of fluorophore decreases in the aggregate, simply because the extra polypeptide takes up volume in the aggregate.

      Yes, the fraction of positive cells in a discontinuous DAmFRET plot does increase with time. However, given the more laborious data collection and derivation of nucleation kinetics in a system with ongoing translation, especially across hundreds of experiments with other variables, ours is a snapshot measurement to approximately derive the relative contributions of intra- and intermolecular fluctuations to the nucleation barrier, rather than the barrier’s magnitude.

      We have revised the tautological statement by removing “non-amyloid containing”.

      Concerning the correlation of our data with the pathological length threshold -- as we state in the first results section, “Our data recapitulated the pathologic threshold -- Q lengths 35 and shorter lacked AmFRET, indicating a failure to aggregate or even appreciably oligomerize, while Q lengths 40 and longer did acquire AmFRET in a length and concentration-dependent manner”. Hence, most of our experiments were conducted with 60Q not because it resembles the pathological threshold, but rather because it was most convenient for DAmFRET experiments.

      Self-poisoning is a widely observed and heavily studied phenomenon in polymer crystal physics, though it seems not yet to have entered the lexicon of amyloid biologists. We were new to this concept before it emerged as an extremely parsimonious explanation for our results. As described in the text, two pieces of evidence exclude the alternative mechanism suggested by the reviewer -- that non-structured oligomers form and subsequently engage and inhibit the template. Specifically, 1) inhibition occurs without any detectable FRET, even at high total protein concentration, indicating the species do not form in a concentration-dependent manner that would be expected of disordered oligomers; and 2) inhibition itself has strict sequence requirements that match those of Q zippers. Hence our data collectively suggest that inhibition is a consequence of the deposition of partially ordered molecules onto the templating surface.

      We have softened the subheading and text of the relevant section in the discussion to more clearly indicate the speculative nature of our statements concerning the possible role of self-poisoned oligomers in toxicity.

      We stand by our statement 'that kinetically arrested aggregates emerge from the same nucleating event responsible for amyloid formation', as this follows directly from self-poisoning.

      Regarding the arguments for lateral and axial growth, we agree that the data are indirect. However, that polyQ forms lamellar amyloids both in vitro and in vivo is now established, so we do not feel it necessary to rigorously show that here. Nevertheless, we need to include this section primarily because it introduces the fact that ordering in polyQ amyloid occurs in the lateral as well as axial dimensions, and the onset of lateral ordering (lamellar growth) explains the very different behaviors of QU and QB sequences apparent on the DAmFRET plots. Ultimately, the two dimensions of growth are important to understand self-poisoning and maturation of the short nucleating zipper to amyloid.

      References

      Arseni D, Hasegawa M, Murzin AG, Kametani F, Arai M, Yoshida M, Ryskeldi-Falcon B. 2022. Structure of pathological TDP-43 filaments from ALS with FTLD. Nature 601:139–143. doi:10.1038/s41586-021-04199-3

      Bansal A, Schmidt M, Rennegarbe M, Haupt C, Liberta F, Stecher S, Puscalau-Girtu I, Biedermann A, Fändrich M. 2021. AA amyloid fibrils from diseased tissue are structurally different from in vitro formed SAA fibrils. Nat Commun 12:1013. doi:10.1038/s41467-021-21129-z

      Buell AK. 2017. The Nucleation of Protein Aggregates - From Crystals to Amyloid Fibrils. Int Rev Cell Mol Biol 329:187–226. doi:10.1016/bs.ircmb.2016.08.014

      Chakraborty D, Straub JE, Thirumalai D. 2023. Energy landscapes of Aβ monomers are sculpted in accordance with Ostwald’s rule of stages. Sci Adv 9:eadd6921. doi:10.1126/sciadv.add6921 Crist B, Schultz JM. 2016. Polymer spherulites: A critical review. Prog Polym Sci 56:1–63. doi:10.1016/j.progpolymsci.2015.11.006

      De Yoreo JJ. 2022. Casting a bright light on Ostwald’s rule of stages. Proc Natl Acad Sci USA 119. doi:10.1073/pnas.2121661119

      Hong Y, Yuan S, Li Z, Ke Y, Nozaki K, Miyoshi T. 2015. Three-Dimensional Conformation of Folded Polymers in Single Crystals. Phys Rev Lett 115:168301. doi:10.1103/PhysRevLett.115.168301 Keller A. 1957. A note on single crystals in polymers: Evidence for a folded chain configuration. Philosophical Magazine 2:1171–1175. doi:10.1080/14786435708242746

      Landgraf D, Okumus B, Chien P, Baker TA, Paulsson J. 2012. Segregation of molecules at cell division reveals native protein localization. Nat Methods 9:480–482. doi:10.1038/nmeth.1955

      Lauritzen JI, Hoffman JD. 1960. Theory of Formation of Polymer Crystals with Folded Chains in Dilute Solution. J Res Natl Bur Stand A Phys Chem 64A:73–102. doi:10.6028/jres.064A.007

      Navrotsky A. 2004. Energetic clues to pathways to biomineralization: precursors, clusters, and nanoparticles. Proc Natl Acad Sci USA 101:12096–12101. doi:10.1073/pnas.0404778101

      Ohhashi Y, Ito K, Toyama BH, Weissman JS, Tanaka M. 2010. Differences in prion strain conformations result from non-native interactions in a nucleus. Nat Chem Biol 6:225–230. doi:10.1038/nchembio.306

      Organ SJ, Ungar G, Keller A. 1989. Rate minimum in solution crystallization of long paraffins. Macromolecules 22:1995–2000. doi:10.1021/ma00194a078

      Radamaker L, Baur J, Huhn S, Haupt C, Hegenbart U, Schönland S, Bansal A, Schmidt M, Fändrich M. 2021. Cryo-EM reveals structural breaks in a patient-derived amyloid fibril from systemic AL amyloidosis. Nat Commun 12:875. doi:10.1038/s41467-021-21126-2

      Sahoo B, Singer D, Kodali R, Zuchner T, Wetzel R. 2014. Aggregation behavior of chemically synthesized, full-length huntingtin exon1. Biochemistry 53:3897–3907. doi:10.1021/bi500300c

      Schmelzer JWP, Abyzov AS. 2017. How do crystals nucleate and grow: ostwald’s rule of stages and beyond In: Šesták J, Hubík P, Mareš JJ, editors. Thermal Physics and Thermal Analysis, Hot Topics in Thermal Analysis and Calorimetry. Cham: Springer International Publishing. pp. 195–211. doi:10.1007/978-3-319-45899-1_9

      Schmidt M, Wiese S, Adak V, Engler J, Agarwal S, Fritz G, Westermark P, Zacharias M, Fändrich M. 2019. Cryo-EM structure of a transthyretin-derived amyloid fibril from a patient with hereditary ATTR amyloidosis. Nat Commun 10:5008. doi:10.1038/s41467-019-13038-z

      Schweighauser M, Shi Y, Tarutani A, Kametani F, Murzin AG, Ghetti B, Matsubara T, Tomita T, Ando T, Hasegawa K, Murayama S, Yoshida M, Hasegawa M, Scheres SHW, Goedert M. 2020. Structures of α-synuclein filaments from multiple system atrophy. Nature 585:464–469. doi:10.1038/s41586-020-2317-6

      Snapp EL, Hegde RS, Francolini M, Lombardo F, Colombo S, Pedrazzini E, Borgese N, Lippincott-Schwartz J. 2003. Formation of stacked ER cisternae by low affinity protein interactions. J Cell Biol 163:257–269. doi:10.1083/jcb.200306020

      Törnquist M, Michaels TCT, Sanagavarapu K, Yang X, Meisl G, Cohen SIA, Knowles TPJ, Linse S. 2018. Secondary nucleation in amyloid formation. Chem Commun 54:8667–8684. doi:10.1039/c8cc02204f

      Ungar G, Putra EGR, de Silva DSM, Shcherbina MA, Waddon AJ. 2005. The Effect of Self-Poisoning on Crystal Morphology and Growth Rates In: Allegra G, editor. Interphases and Mesophases in Polymer Crystallization I, Advances in Polymer Science. Berlin, Heidelberg: Springer Berlin Heidelberg. pp. 45–87. doi:10.1007/b107232

      Vetri V, Foderà V. 2015. The route to protein aggregate superstructures: Particulates and amyloid-like spherulites. FEBS Lett 589:2448–2463. doi:10.1016/j.febslet.2015.07.006

      Wild EJ, Boggio R, Langbehn D, Robertson N, Haider S, Miller JRC, Zetterberg H, Leavitt BR, Kuhn R, Tabrizi SJ, Macdonald D, Weiss A. 2015. Quantification of mutant huntingtin protein in cerebrospinal fluid from Huntington’s disease patients. The Journal of Clinical Investigation.

      Yang Y, Arseni D, Zhang W, Huang M, Lövestam S, Schweighauser M, Kotecha A, Murzin AG, Peak-Chew SY, Macdonald J, Lavenir I, Garringer HJ, Gelpi E, Newell KL, Kovacs GG, Vidal R, Ghetti B, Ryskeldi-Falcon B, Scheres SHW, Goedert M. 2022. Cryo-EM structures of amyloid-β 42 filaments from human brains. Science 375:167–172. doi:10.1126/science.abm7285

      Zhang X, Zhang W, Wagener KB, Boz E, Alamo RG. 2018. Effect of Self-Poisoning on Crystallization Kinetics of Dimorphic Precision Polyethylenes with Bromine. Macromolecules 51:1386–1397. doi:10.1021/acs.macromol.7b02745

    2. Author Response

      eLife assessment

      In this valuable study, the authors investigate the mechanism of amyloid nucleation in a cellular system using their novel ratiometric measurements and uncover interesting insights regarding the role of polyglutamine length and the sequence features of glutamine-rich regions on amyloid formation. Overall, the problem is significant and being able to assess nucleation in cells is of considerable relevance. The data, as presented and analyzed, are currently still incomplete. The specific claims would be stronger if based on in vitro measurements that avoid the intricacies of specific cellular systems and that are more suitable for assessing sequence-intrinsic properties.

      We are pleased that the editors find our study valuable. We find that the reviewers’ criticisms largely arise from misunderstandings inherent to the conceptually challenging nature of the topic, rather than fundamental flaws, as we will elaborate here. We are grateful for the opportunity afforded by eLife to engage reviewers in a constructive public dialogue.

      Reviewer #1 (Public Review):

      The authors take on the challenge of defining the core nucleus for amyloid formation by polyglutamine tracts. This rests on the assertion that polyQ forms amyloid structures to the exclusion of all other forms of solids. Using their unique assay, deployed in yeast, the authors attempt to infer the size of the nucleus that templates amyloid formation by polyQ. Further, through a series of sequence titrations, all studied using a single type of assay, the authors converge on an assertion stating that a single polyQ molecule is the nucleus for amyloid formation, that 12-residues make up the core of the nucleus, that it takes ca. 60 Qs in a row to unmask this nucleation potential, and that polyQ amyloid formation belongs to the same universality class as self-poisoned crystallization, which is the hallmark of crystallization from polymer melts formed by large, high molecular weight synthetic polymers. Unfortunately, the authors have decided to lean in hard on their assertions without a critical assessment of whether their findings stand up to scrutiny. If their findings are truly an intrinsic property of polyQ molecules, then their findings should be reconstituted in vitro. Unfortunately, careful and rigorous experiments in vitro show that there is a threshold concentration for forming fibrillar solids. This threshold concentration depends on the flanking sequence context on temperature and on solution conditions. The existence of a threshold concentration defies the expectation of a monomer nucleus. The findings disagree with in vitro data presented by Crick et al., and ignored by the authors. Please see: https://doi.org/10.1073/pnas.1320626110. These reports present data from very different assays, the importance of which was underscored first by Regina Murphy and colleagues. The work of Crick et al., provides a detailed thermodynamic framework - see the SI Appendix. This framework dove tails with theory and simulations of Zhang and Muthukumar, which explains exactly how a system like polyQ might work (https://doi.org/10.1063/1.3050295). The picture one paints is radically different from what the authors converge upon. One is inclined to lean toward data that are gleaned using multiple methods in vitro because the test tube does not have all the confounding effects of a cellular milieu, especially when it comes to focusing on sequence-intrinsic conformational transitions of a protein. In addition to concerns about the limitations of the DAmFRET method, which based on the work of the authors in their collaborative paper by Posey et al., are being stretched to the limit, there is the real possibility that the cellular milieu, unique to the system being studied, is enabling transitions that are not necessarily intrinsic to the sequence alone. A nod in this direction is the work of Marc Diamond, which showed that having stabilized the amyloid form of Tau through coacervation, there is a large barrier that limits the loss of amyloid-like structure for Tau. There may well be something similar going on with the polyQ system. If the authors could show that their data are achievable in vitro without anything but physiological buffers one would have more confidence in a model that appears to contradict basic physical principles of how homopolymers self-assemble. Absent such additional evidence, numerous statements seem to be too strong. There are also several claims that are difficult to understand or appreciate.

      Rebuttal to the perceived necessity of in vitro experiments

      The overarching concern of this reviewer and reviewing editor is whether in-cell assays can inform on sequence-intrinsic properties. We understand this concern. We believe however that the relative merit of in-cell assays is largely a matter of perspective. The truly sequence-intrinsic behavior of polyQ, i.e. in a vacuum, is less informative than the “sequence-intrinsic” behaviors of interest that emerge in the presence of extraneous molecules from the appropriate biological context. In vitro experiments typically include a tiny number of these -- water, ions, and sometimes a crowding agent meant to approximate everything else. Obviously missing are the myriad quinary interactions with other proteins that collectively round out the physiological solvent. The question is what experimental context best approximates that of a living human neuron under which the pathological sequence-dependent properties of polyQ manifest. We submit that a living yeast cell comes closer to that ideal than does buffer in a test tube.

      The reviewer’s statements that our findings must be validated in vitro ignores the fact -- stressed in our introduction -- that decades of in vitro work have not yet generated definitive evidence for or against any specific nucleus model. In addition to the above, one major problem concerns the large sizes of in vitro systems that obscure the effects of primary nucleation. For example, a typical in vitro experimental volume of e.g. 1.5 ml is over one billion-fold larger than the femtoliter volume of a cell. This means that any nucleation-limited kinetics of relevant amyloid formation are lost, and any alternative amyloid polymorphs that have a kinetic growth advantage -- even if they nucleate at only a fraction the rate of relevant amyloid -- will tend to dominate the system (Buell, 2017). Novel approaches are clearly needed to address these problems. We present such an approach, stretch it to the limit (as the reviewer notes) across multiple complementary experiments, and arrive at a novel finding that is fully and uniquely consistent with all of our own data as well as the collective prior literature.

      That the preceding considerations are collectively essential to understand relevant amyloid behavior is evident from recent cryoEM studies showing that in vitro-generated amyloid structures generally differ from those in patients (Arseni et al., 2022; Bansal et al., 2021; Radamaker et al., 2021; Schmidt et al., 2019; Schweighauser et al., 2020; Yang et al., 2022). This is highly relevant to the present discourse because each amyloid structure is thought to emanate from a different nucleating structure. This means that in vitro experiments have broadly missed the mark in terms of the relevant thermodynamic parameters that govern disease onset and progression. Note that the rules laid out via our studies are not only consistent with structural features of polyQ amyloid in cells, but also (as described in the discussion) explain why the endogenous structure of a physiologically relevant Q zipper amyloid differs from that of polyQ.

      A recent collaboration between the Morimoto and Knowles groups (Sinnige et al.) investigated the kinetics of aggregation by Q40-YFP expressed in C. elegans body wall muscle cells, using quantitative approaches that have been well established for in vitro amyloid-forming systems of the type favored by the reviewer. They calculate a reaction order of just 1.6, slightly higher than what would be expected for a monomeric nucleus but nevertheless fully consistent with our own conclusions when one accounts for the following two aspects of their approach. First, the polyQ tract in their construct is flanked by short poly-Histidine tracts on both sides. These charges very likely disfavor monomeric nucleation because all possible configurations of a four-stranded bundle position the beginning and end of the Q tract in close proximity, and Q40 is only just long enough to achieve monomeric nucleation in the absence of such destabilization. Second, the protein is fused to YFP, a weak homodimer (Landgraf et al., 2012; Snapp et al., 2003). With these two considerations, our model -- which was generated from polyQ tracts lacking flanking charges or an oligomeric fusion -- predicts that amyloid nucleation by their construct will occur more frequently as a dimer than a monomer. Indeed, their observed reaction order of 1.6 supports a predominantly dimeric nucleus. Like us and others, Sinnige et al. did not observe phase separation prior to amyloid formation. This is important because it not only argues against nucleation occurring in a condensate, it also suggests that the reaction order they calculated has not been limited by the concentration-buffering effect of phase separation.

      While we agree that our conclusions rest heavily on DAmFRET data (for good reason), we do provide supporting evidence from molecular dynamics simulations, SDD-AGE, and microscopy.

      To summarize, given the extreme limitations of in vitro experiments in this field, the breadth of our current study, and supporting findings from another lab using rigorous quantitative approaches, we feel that our claims are justified without in vitro data.

      Rebuttal to the perceived incompatibility of monomeric nucleation with the existence of a critical concentration for amyloid

      We appreciate that the concept of a monomeric nucleus can superficially appear inconsistent with the fact that crystalline solids such as polyQ amyloid have a saturating concentration, but this is only true if one neglects that polyQ amyloids are polymer crystals with intramolecular ordering. The perceived discrepancy is perhaps most easily dispelled by protein crystallography. Folded proteins form crystals. These crystals have critical concentrations, and the protein subunits within them each have intramolecular crystalline order (in the form of secondary structure). To extrapolate these familiar examples to our present finding with polyQ, one need only appreciate the now well-established phenomenon of secondary nucleation, whereby transient interactions of soluble species with the ordered species leads to their own ordering (Törnquist et al., 2018). Transience is important here because it implies that intramolecular ordering can in principle propagate even in solutions that are subsaturated with respect to bulk crystallization. This is possible in the present case because the pairing of sufficiently short beta strands (equivalent to “stems” in the polymer crystal literature) will be more stable intramolecularly than intermolecularly, due to the reduced entropic penalty of the former. Our elucidation that Q zipper ordering can occur with shorter strands intramolecularly than intermolecularly (Fig. S4C-D) demonstrates this fact. It is also evident from published descriptions of single molecule “crystals” formed in sufficiently dilute solutions of sufficiently long polymers (Hong et al., 2015; Keller, 1957; Lauritzen and Hoffman, 1960).

      In suggesting that a saturating concentration for amyloid rules out monomeric nucleation, the reviewer assumes that the Q zipper-containing monomer must be stable relative to the disordered ensemble. This is not inherent to our claim and in fact opposes the definition of a nucleus. The monomeric nucleating structure need not be more stable than the disordered state, and monomers may very well be disordered at equilibrium at low concentrations. To be clear, our claim requires that the Q zipper-containing monomer is both on pathway to amyloid and less stable than all subsequent species that are on pathway to amyloid. The former requirement is supported by our extensive mutational analysis. The latter requirement is supported by our atomistic simulations showing the Q zipper-containing monomer is stabilized by dimerization (see our 2021 preprint). Hence, requisite ordering in the nucleating monomer is stabilized by intermolecular interactions. We provide in Author response image 1 an illustration to clarify what we believe to be the discrepancy between our claim and the reviewer’s interpretation.

      Author response image 1.

      That the rate-limiting fluctuation for a crystalline phase can occur in a monomer can also be understood as a consequence of Ostwald’s rule of stages, which describes the general tendency of supersaturated solutes, including amyloid forming proteins (Chakraborty et al., 2023), to populate metastable phases en route to more stable phases (De Yoreo, 2022; Schmelzer and Abyzov, 2017). Our findings with polyQ are consistent with a general mechanism for Ostwald’s rule wherein the relative stabilities of competing polymorphs differ with the number of subunits (De Yoreo, 2022; Navrotsky, 2004). As illustrated in Fig. 6 of Navrotsky, a polymorph that is relatively stable at small particle sizes tends to give way to a polymorph that -- while initially unstable -- becomes more stable as the particles grow. The former is analogous to our early stage Q zipper composed of two short sheets with an intramolecular interface, while the latter is analogous to the later stage Q zipper composed of longer sheets with an intermolecular interface. Subunit addition stabilizes the latter more than the former, hence the initial Q zipper that is stabilized more by intra- than intermolecular interactions will mature with growth to one that is stabilized more by intermolecular interactions.

      We apologize to the Pappu group for neglecting to cite Crick et al. 2013 in the current preprint. Contrary to the reviewer’s assessment, however, we find that the conclusions of this valuable study do more to support than to refute our findings. Briefly, Crick et al. investigated the aggregation of synthetic Q30 and Q40 peptides in vitro, wherein fibrils assembled from high concentrations of peptide were demonstrated to have saturating concentrations in the low micromolar range. As explained above, this finding of a saturating concentration does not refute our results. More relevant to the present work are their findings that “oligomers” accumulated over an hours-long timespan in solutions that are subsaturated with respect to fibrils, and these oligomers themselves have (nanomolar) critical concentrations. The authors postulated that the oligomers result from liquid–liquid demixing of intrinsically disordered polyglutamine. However, phase separation by a peptide is expected to fix its concentration in both the solute and condensed phases, and, because disordered phase separation is inherently faster than amyloid formation, the postulated explanation removes the driving force for any amyloid phase with a critical solubility greater than that of the oligomers. In place of this interpretation that truly does appear to -- in the reviewer’s words -- “contradict basic physical principles of how homopolymers self-assemble”, we interpret these oligomers as evidence of our Q zipper-containing self-poisoned multimers, rounded as an inherent consequence of self-poisoning (Ungar et al., 2005), and likely akin to semicrystalline spherulites that have been observed in other polymer crystal and amyloid-forming systems (Crist and Schultz, 2016; Vetri and Foderà, 2015). That Crick et al. also observed the formation of a relatively labile amyloid phase when the reactions were started with 50 uM peptide is unsurprising in light of the aforementioned kinetic advantage that large reaction volumes can confer to labile polymorphs, and that high concentrations (in this case, orders of magnitude higher than the likely physiological concentration of polyQ (Wild et al., 2015)) can favor the formation of labile amyloid polymorphs (Ohhashi et al., 2010). Indeed, a contemporaneous study by the Wetzel group using very similar peptide constructs and polyQ lengths -- but beginning with lower concentrations -- found that the relevant saturating concentrations for amyloid lie below their limit of detection of 100 nM (Sahoo et al., 2014).

      Rebuttals to other critiques

      The reviewer states that we found nucleation potential to require 60 Qs in a row. Our data are collectively consistent with nucleation occurring at and above approximately 36 Qs, a point repeated in the paper. The reviewer may be referring to our statement, ”Sixty residues proved to be the optimum length to observe both the pre- and post-nucleated states of polyQ in single experiments”. The purpose of this statement is simply to describe the practical consideration that led us to use 60 Qs for the bulk of our assays. We do appreciate that the fraction of AmFRET-positive cells is very low for lengths just above the threshold, especially Q40. They are nevertheless highly significant (p = 0.004 in [PIN+] cells, one-tailed T-test), and we will modify the figure and text to clarify this.

      The reviewer characterizes self-poisoning as the hallmark of crystallization from polymer melts, which would be problematic for our conclusions if self-poisoning were limited to this non-physiological context. In fact the term was first used to describe crystallization from solution (Organ et al., 1989), wherein the phenomenon is more pronounced (Ungar et al., 2005).

      Reviewer #2 (Public Review):

      Numerous neurodegenerative diseases are thought to be driven by the aggregation of proteins into insoluble filaments known as "amyloids". Despite decades of research, the mechanism by which proteins convert from the soluble to insoluble state is poorly understood. In particular, the initial nucleation step is has proven especially elusive to both experiments and simulation. This is because the critical nucleus is thermodynamically unstable, and therefore, occurs too infrequently to directly observe. Furthermore, after nucleation much faster processes like growth and secondary nucleation dominate the kinetics, which makes it difficult to isolate the effects of the initial nucleation event. In this work Kandola et al. attempt to surmount these obstacles using individual yeast cells as microscopic reaction vessels. The large number of cells, and their small size, provides the statistics to separate the cells into pre- and post-nucleation populations, allowing them to obtain nucleation rates under physiological conditions. By systematically introducing mutations into the amyloid-forming polyglutamine core of huntingtin protein, they deduce the probable structure of the amyloid nucleus. This work shows that, despite the complexity of the cellular environment, the seemingly random effects of mutations can be understood with a relatively simple physical model. Furthermore, their model shows how amyloid nucleation and growth differ in significant ways, which provides testable hypotheses for probing how different steps in the aggregation pathway may lead to neurotoxicity.

      In this study Kandola et al. probe the nucleation barrier by observing a bimodal distribution of cells that contain aggregates; the cells containing aggregates have had a stochastic fluctuation allowing the proteins to surmount the barrier, while those without aggregates have yet to have a fluctuation of suitable size. The authors confirm this interpretation with the selective manipulation of the PIN gene, which provides an amyloid template that allows the system to skip the nucleation event.

      In simple systems lacking internal degrees of freedom (i.e., colloids or rigid molecules) the nucleation barrier comes from a significant entropic cost that comes from bringing molecules together. In large aggregates this entropic cost is balanced by attractive interactions between the particles, but small clusters are unable to form the extensive network of stabilizing contacts present in the larger aggregates. Therefore, the initial steps in nucleation incur an entropic cost without compensating attractive interactions (this imbalance can be described as a surface tension). When internal degrees of freedom are present, such as the conformational states of a polypeptide chain, there is an additional contribution to the barrier coming from the loss of conformational entropy required to the adopt aggregation-prone state(s). In such systems the clustering and conformational processes do not necessarily coincide, and a major challenge studying nucleation is to separate out these two contributions to the free energy barrier. Surprisingly, Kandola et al. find that the critical nucleus occurs within a single molecule. This means that the largest contribution to the barrier comes from the conformational entropy cost of adopting the beta-sheet state. Once this state is attained, additional molecules can be recruited with a much lower free energy barrier.

      There are several caveats that come with this result. First, the height of the nucleation barrier(s) comes from the relative strength of the entropic costs compared to the binding affinities. This balance determines how large a nascent nucleus must grow before it can form interactions comparable to a mature aggregate. In amyloid nuclei the first three beta strands form immature contacts consisting of either side chain or backbone contacts, whereas the fourth strand is the first that is able to form both kinds of contacts (as in a mature fibril). This study used relatively long polypeptides of 60 amino acids. This is greater than the 20-40 amino acids found in amyloid-forming molecules like ABeta or IAPP. As a result, Kandola et al.'s molecules are able to fold enough times to create four beta strands and generate mature contacts intramolecularly. The authors make the plausible claim that these intramolecular folds explain the well-known length threshold (L~35) observed in polyQ diseases. The intramolecular folds reduce the importance of clustering multiple molecules together and increase the importance of the conformational states. Similarly, manipulating the sequence or molecular concentrations will be expected to manipulate the relative magnitude of the binding affinities and the clustering entropy, which will shift the relative heights of the entropic barriers.

      The reviewer correctly notes that the majority of our manipulations were conducted with 60-residue long tracts (which corresponds to disease onset in early adulthood), and this length facilitates intramolecular nucleation. However, we also analyzed a length series of polyQ spanning the pathological threshold, as well as a synthetic sequence designed explicitly to test the model nucleus structure with a tract shorter than the pathological threshold, and both experiments corroborate our findings.

      The authors make an important point that the structure of the nucleus does not necessarily resemble that of the mature fibril. They find that the critical nucleus has a serpentine structure that is required by the need to form four beta strands to get the first mature contacts. However, this structure comes at a cost because residues in the hairpins cannot form strong backbone or zipper interactions. Mature fibrils offer a beta sheet template that allows incoming molecules to form mature contacts immediately. Thus, it is expected that the role of the serpentine nucleus is to template a more extended beta sheet structure that is found in mature fibrils.

      A second caveat of this work is the striking homogeneity of the nucleus structure they describe. This homogeneity is likely to be somewhat illusory. Homopolymers, like polyglutamine, have a discrete translational symmetry, which implies that the hairpins needed to form multiple beta sheets can occur at many places along the sequence. The asparagine residues introduced by the authors place limitations on where the hairpins can occur, and should be expected to increase structural homogeneity. Furthermore, the authors demonstrate that polyglutamine chains close to the minimum length of ~35 will have strict limitations on where the folds must occur in order to attain the required four beta strands.

      We are unsure how to interpret the above statements as a caveat. We agree that increasing sequence complexity will tend to increase homogeneity, but this is exactly the motivation of our approach. We explicitly set out to determine the minimal complexity sequence sufficient to specify the nucleating conformation, which we ultimately identified in terms of secondary and tertiary structure. We do not specify which parts of a long polyQ tract correspond to which parts of the structure, because, as the reviewer points out, they can occur at many places. Hence, depending on the length of the polyQ tract, the nucleus we describe may have any length of sequence connecting the strand elements. We do not think that the effects of N-residue placement can be interpreted as a confounding influence on hairpin position because the striking even-odd pattern we observe implicates the sides of beta strands rather than the lengths. Moreover, we observe this pattern regardless of the residue used (Gly, Ser, Ala, and His in addition to Asn).

      A novel result of this work is the observation of multiple concentration regimes in the nucleation rate. Specifically, they report a plateau-like regime at intermediate regimes in which the nucleation rate is insensitive to protein concentration. The authors attribute this effect to the "self-poisoning" phenomenon observed in growth of some crystals. This is a valid comparison because the homogeneity observed in NMR and crystallography structures of mature fibrils resemble a one-dimensional crystal. Furthermore, the typical elongation rate of amyloid fibrils (on the order of one molecule per second) is many orders of magnitude slower than the molecular collision rate (by factors of 10^6 or more), implying that the search for the beta-sheet state is very slow. This slow conformational search implies the presence of deep kinetic traps that would be prone to poisoning phenomena. However, the observation of poisoning in nucleation during nucleation is striking, particularly in consideration of the expected disorder and concentration sensitivity of the nucleus. Kandola et al.'s structural model of an ordered, intramolecular nucleus explains why the internal states responsible for poisoning are relevant in nucleation.

      We thank the reviewer for noting the novelty and plausibility of the self-poisoning connection. We would like to elaborate on our finding that self-poisoning inhibits nucleation (in addition to elongation), as this could prove confusing to some readers. While self-poisoning is claimed to inhibit primary nucleation in the polymer crystal literature (Ungar et al., 2005; Zhang et al., 2018), the semantics of “nucleation” in this context warrants clarification. Technically, the same structure can be considered a nucleus in one context but not in another. The Q zipper monomer, even if it is rate-limiting for amyloid formation at low concentrations (and is therefore the “nucleus”), is not necessarily rate-limiting when self-poisoned at high concentrations. Whether it comprises the nucleus in this case depends on the rates of Q zipper formation relative to subunit addition to the poisoned state. If the latter happens slower than Q zipper formation de novo, it can be said that self-poisoning inhibits nucleation, regardless of whether the Q zipper formed. We suspect this to be the mechanism by which preemptive oligomerization blocks nucleation in the case of polyQ, though other mechanisms may be possible.

      To achieve these results the authors used a novel approach involving a systematic series of simple sequences. This is significant because, while individual experiments showed seemingly random behavior, the randomness resolved into clear trends with the systematic approach. These trends provided clues to build a model and guide further experiments.

      Reviewer #3 (Public Review):

      Kandola et al. explore the important and difficult question regarding the initiating event that triggers (nucleates) amyloid fibril growth in glutamine-rich domains. The researchers use a fluorescence technique that they developed, dAMFRET, in a yeast system where they can manipulate the expression level over several orders of magnitude, and they can control the length of the polyglutamine domain as well as the insertion of interfering non-glutamine residues. Using flow cytometry, they can interrogate each of these yeast 'reactors' to test for self-assembly, as detected by FRET.

      In the introduction, the authors provide a fairly thorough yet succinct review of the relevant literature into the mechanisms of polyglutamine-mediated aggregation over the last two decades. The presentation as well as the illustrations in Figure 1A and 1B are difficult to understand, and unfortunately, there is no clear description of the experimental technique that would allow the reader to connect the hypothetical illustrations to the measurement outcomes. The authors do not explain what the FRET signal specifically indicates or what its intensity is correlated to. FRET measures distance between donor and acceptor, but can it be reliably taken as an indicator of a specific beta-sheet conformation and of amyloid? Does the signal increase with both nucleation and with elongation, and is the signal intensity the same if, e.g., there were 5 aggregates of 10 monomers each versus 50 monomeric nuclei? Is there a reason why the AmFRET signal intensity decreases at longer Q even though the number of cells with positive signal increases? Does the number of positive cells increase with time? The authors state later that 'non-amyloid containing cells lacked AmFRET altogether', but this seems to be a tautology - isn't the lack of AmFRET taken as a proof of lack of amyloid? Overall, a clearer description of the experimental method and what is actually measured (and validation of the quantitative interpretation of the FRET signal) would greatly assist the reader in understanding and interpreting the data.

      We believe the difficulty in understanding the illustrations in Figure 1A and 1B is inherent to the subject. We agree that elaborating how DAmFRET works would help the reader, and will add a few sentences to this end. Beyond this, we refer the reviewer and readers to our cited prior work describing the theory and interpretation of DAmFRET. Note that the y-axes of DAmFRET plots are not raw FRET but rather “AmFRET”, a ratio of FRET to total expression level. As explained thoroughly in our cited prior work, the discontinuity of AmFRET with expression level indicates that the high AmFRET-population formed via a disorder-to-order transition. When the query protein is predicted to be intrinsically disordered, the discontinuous transition to high AmFRET invariably (among hundreds of proteins tested in prior published and unpublished work) signifies amyloid formation as corroborated by SDD-AGE and tinctorial assays.

      When performed using standard flow cytometry as in the present study, every AmFRET measurement corresponds to a cell-wide average, and hence does not directly inform on the distribution of the protein between different stoichiometric species. As there is only one fluorophore per protein molecule, monomeric nuclei have no signal. DAmFRET can distinguish cells expressing monomers from stable dimers from higher order oligomers (see e.g. Venkatesan et al. 2019), and we are therefore quite confident that AmFRET values of zero correspond to cells in which a vast majority of the respective protein is not in homo-oligomeric species (i.e. is monomeric or in hetero-complexes with endogenous proteins). The exact value of AmFRET, even for species with the same stoichiometry, will depend both on the effect of their respective geometries on the proximity of mEos3.1 fluorophores, and on the fraction of protein molecules in the species. Hence, we only attempt to interpret the plateau values of AmFRET (where the fraction of protein in an assembled state approaches unity) as directly informing on structure, as we did in Fig. S3A.

      We believe that AmFRET decreases with longer polyQ because the mass fraction of fluorophore decreases in the aggregate, simply because the extra polypeptide takes up volume in the aggregate.

      Yes, the fraction of positive cells in a discontinuous DAmFRET plot does increase with time. However, given the more laborious data collection and derivation of nucleation kinetics in a system with ongoing translation, especially across hundreds of experiments with other variables, ours is a snapshot measurement to approximately derive the relative contributions of intra- and intermolecular fluctuations to the nucleation barrier, rather than the barrier’s magnitude.

      We will revise the tautological statement by removing “non-amyloid containing”.

      The authors demonstrate that their assay shows that the fraction of cells with AmFRET signal increases strongly with an increase in polyQ length, with a 'threshold around 50-60 glutamines. This roughly correlates with the Q-length dependence of disease. The experiments in which asparagine or other amino acids are inserted at variable positions in the glutamine repeat are creative and thorough, and the data along with the simulations provide compelling support for the proposed Q zipper model. The experiments shown in Figure 5 are strongly supportive of a model where formation of the beta-sheet nucleus is within a monomer. This is a potentially important result, as there are conflicting data in the literature as to whether the nucleus in polyQ is monomer.

      We thank the reviewer for these comments. We wish to clarify one important point, however, concerning the correlation of our data with the pathological length threshold. As we state in the first results section, “Our data recapitulated the pathologic threshold -- Q lengths 35 and shorter lacked AmFRET, indicating a failure to aggregate or even appreciably oligomerize, while Q lengths 40 and longer did acquire AmFRET in a length and concentration-dependent manner”. Hence, most of our experiments were conducted with 60Q not because it resembles the pathological threshold, but rather because it was most convenient for DAmFRET experiments.

      I did not find the argument, that their data shows the Q zipper grows in two dimensions, compelling; there are more direct experimental methods to answer this question. I was also confused by the section that Q zippers poison themselves. It would be easier for the reader to follow if the authors first presented their results without interpretation. The data seem more consistent with an argument that, at high concentrations, non-structured polyQ oligomers form which interfere with elongation into structured amyloid assemblies - but such oligomers would not be zippers.

      Self-poisoning is a widely observed and heavily studied phenomenon in polymer crystal physics, though it seems not yet to have entered the lexicon of amyloid biologists. We were new to this concept before it emerged as an extremely parsimonious explanation for our results. As described in the text, two pieces of evidence exclude the alternative mechanism suggested by the reviewer -- that non-structured oligomers form and subsequently engage and inhibit the template. Specifically, 1) inhibition occurs without any detectable FRET, even at high total protein concentration, indicating the species do not form in a concentration-dependent manner that would be expected of disordered oligomers; and 2) inhibition itself has strict sequence requirements that match those of Q zippers. Hence our data collectively suggest that inhibition is a consequence of the deposition of partially ordered molecules onto the templating surface.

      Although some speculation or hypothesizing is perfectly appropriate in the discussion, overall the authors stretch this beyond what can be supported by the results. A couple of examples: The conclusion that toxicity arises from 'self-poisoned polymer crystals' is not warranted, as there is no relevant data presented in this manuscript. The authors refer to findings 'that kinetically arrested aggregates emerge from the same nucleating event responsible for amyloid formation', but I cannot recall any evidence for this statement in the results section.

      We restricted any mention of toxicity to the introduction and a section in the discussion that is not worded as conclusive. Nevertheless, we will soften the subheading and text of the relevant section in the discussion to more clearly indicate the speculative nature of the statements.

      We stand by our statement 'that kinetically arrested aggregates emerge from the same nucleating event responsible for amyloid formation', as this follows directly from self-poisoning.

      Bibliography

      Arseni D, Hasegawa M, Murzin AG, Kametani F, Arai M, Yoshida M, Ryskeldi-Falcon B. 2022. Structure of pathological TDP-43 filaments from ALS with FTLD. Nature 601:139–143. doi:10.1038/s41586-021-04199-3

      Bansal A, Schmidt M, Rennegarbe M, Haupt C, Liberta F, Stecher S, Puscalau-Girtu I, Biedermann A, Fändrich M. 2021. AA amyloid fibrils from diseased tissue are structurally different from in vitro formed SAA fibrils. Nat Commun 12:1013. doi:10.1038/s41467-021-21129-z

      Buell AK. 2017. The Nucleation of Protein Aggregates - From Crystals to Amyloid Fibrils. Int Rev Cell Mol Biol 329:187–226. doi:10.1016/bs.ircmb.2016.08.014

      Chakraborty D, Straub JE, Thirumalai D. 2023. Energy landscapes of Aβ monomers are sculpted in accordance with Ostwald’s rule of stages. Sci Adv 9:eadd6921. doi:10.1126/sciadv.add6921 Crist B, Schultz JM. 2016. Polymer spherulites: A critical review. Prog Polym Sci 56:1–63. doi:10.1016/j.progpolymsci.2015.11.006

      De Yoreo JJ. 2022. Casting a bright light on Ostwald’s rule of stages. Proc Natl Acad Sci USA 119. doi:10.1073/pnas.2121661119

      Hong Y, Yuan S, Li Z, Ke Y, Nozaki K, Miyoshi T. 2015. Three-Dimensional Conformation of Folded Polymers in Single Crystals. Phys Rev Lett 115:168301. doi:10.1103/PhysRevLett.115.168301

      Keller A. 1957. A note on single crystals in polymers: Evidence for a folded chain configuration. Philosophical Magazine 2:1171–1175. doi:10.1080/14786435708242746

      Landgraf D, Okumus B, Chien P, Baker TA, Paulsson J. 2012. Segregation of molecules at cell division reveals native protein localization. Nat Methods 9:480–482. doi:10.1038/nmeth.1955

      Lauritzen JI, Hoffman JD. 1960. Theory of Formation of Polymer Crystals with Folded Chains in Dilute Solution. J Res Natl Bur Stand A Phys Chem 64A:73–102. doi:10.6028/jres.064A.007

      Navrotsky A. 2004. Energetic clues to pathways to biomineralization: precursors, clusters, and nanoparticles. Proc Natl Acad Sci USA 101:12096–12101. doi:10.1073/pnas.0404778101

      Ohhashi Y, Ito K, Toyama BH, Weissman JS, Tanaka M. 2010. Differences in prion strain conformations result from non-native interactions in a nucleus. Nat Chem Biol 6:225–230. doi:10.1038/nchembio.306

      Organ SJ, Ungar G, Keller A. 1989. Rate minimum in solution crystallization of long paraffins. Macromolecules 22:1995–2000. doi:10.1021/ma00194a078

      Radamaker L, Baur J, Huhn S, Haupt C, Hegenbart U, Schönland S, Bansal A, Schmidt M, Fändrich M. 2021. Cryo-EM reveals structural breaks in a patient-derived amyloid fibril from systemic AL amyloidosis. Nat Commun 12:875. doi:10.1038/s41467-021-21126-2

      Sahoo B, Singer D, Kodali R, Zuchner T, Wetzel R. 2014. Aggregation behavior of chemically synthesized, full-length huntingtin exon1. Biochemistry 53:3897–3907. doi:10.1021/bi500300c

      Schmelzer JWP, Abyzov AS. 2017. How do crystals nucleate and grow: ostwald’s rule of stages and beyond In: Šesták J, Hubík P, Mareš JJ, editors. Thermal Physics and Thermal Analysis, Hot Topics in Thermal Analysis and Calorimetry. Cham: Springer International Publishing. pp. 195–211. doi:10.1007/978-3-319-45899-1_9

      Schmidt M, Wiese S, Adak V, Engler J, Agarwal S, Fritz G, Westermark P, Zacharias M, Fändrich M. 2019. Cryo-EM structure of a transthyretin-derived amyloid fibril from a patient with hereditary ATTR amyloidosis. Nat Commun 10:5008. doi:10.1038/s41467-019-13038-z

      Schweighauser M, Shi Y, Tarutani A, Kametani F, Murzin AG, Ghetti B, Matsubara T, Tomita T, Ando T, Hasegawa K, Murayama S, Yoshida M, Hasegawa M, Scheres SHW, Goedert M. 2020. Structures of α-synuclein filaments from multiple system atrophy. Nature 585:464–469. doi:10.1038/s41586-020-2317-6

      Snapp EL, Hegde RS, Francolini M, Lombardo F, Colombo S, Pedrazzini E, Borgese N, Lippincott-Schwartz J. 2003. Formation of stacked ER cisternae by low affinity protein interactions. J Cell Biol 163:257–269. doi:10.1083/jcb.200306020

      Törnquist M, Michaels TCT, Sanagavarapu K, Yang X, Meisl G, Cohen SIA, Knowles TPJ, Linse S. 2018. Secondary nucleation in amyloid formation. Chem Commun 54:8667–8684. doi:10.1039/c8cc02204f

      Ungar G, Putra EGR, de Silva DSM, Shcherbina MA, Waddon AJ. 2005. The Effect of Self-Poisoning on Crystal Morphology and Growth Rates In: Allegra G, editor. Interphases and Mesophases in Polymer Crystallization I, Advances in Polymer Science. Berlin, Heidelberg: Springer Berlin Heidelberg. pp. 45–87. doi:10.1007/b107232

      Vetri V, Foderà V. 2015. The route to protein aggregate superstructures: Particulates and amyloid-like spherulites. FEBS Lett 589:2448–2463. doi:10.1016/j.febslet.2015.07.006

      Wild EJ, Boggio R, Langbehn D, Robertson N, Haider S, Miller JRC, Zetterberg H, Leavitt BR, Kuhn R, Tabrizi SJ, Macdonald D, Weiss A. 2015. Quantification of mutant huntingtin protein in cerebrospinal fluid from Huntington’s disease patients. The Journal of Clinical Investigation.

      Yang Y, Arseni D, Zhang W, Huang M, Lövestam S, Schweighauser M, Kotecha A, Murzin AG, Peak-Chew SY, Macdonald J, Lavenir I, Garringer HJ, Gelpi E, Newell KL, Kovacs GG, Vidal R, Ghetti B, Ryskeldi-Falcon B, Scheres SHW, Goedert M. 2022. Cryo-EM structures of amyloid-β 42 filaments from human brains. Science 375:167–172. doi:10.1126/science.abm7285

      Zhang X, Zhang W, Wagener KB, Boz E, Alamo RG. 2018. Effect of Self-Poisoning on Crystallization Kinetics of Dimorphic Precision Polyethylenes with Bromine. Macromolecules 51:1386–1397. doi:10.1021/acs.macromol.7b02745

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer 1:

      (1) In Figure 1, it is curious that the authors only chose E.coli and staphytlococcus sciuri to test the induction of Chi3l1. What about other bacteria? Why does only E.coli but not staphytlococcus sciuri induce chi3l1 production? It does not prove that the gut microbiome induces the expression of Chi3l1. If it is the effect of LPS, does it trigger a cell death response or inflammatory responses that are known to induce chi3l1 production? What is the role of peptidoglycan in this experiment? Also, it is recommended to change WT to SPF in the figure and text, as no genetic manipulation was involved in this figure.

      Thank you for your valuable feedback and insightful suggestions. In our study, we tried to identify bacteria from murine gut contents and feces using 16S sequencing. However, only E. coli and Staphylococcus sciuri were identified (Figure 1D). Consequently, our experiments were limited to these two bacterial strains. While we have not tested other bacteria, our data suggest that not all bacteria can induce the expression of Chi3l1. Given that E. coli is Gram-negative and Staphylococcus sciuri is Gram-positive, we hypothesized that the difference in their ability to induce Chi3l1 expression might be due to variations between Gram-negative and Gram-positive bacteria, such as the presence of lipopolysaccharides (LPS).

      To test this hypothesis, we used LPS to induce Chi3l1 expression. Consistent with our hypothesis, LPS successfully induced Chi3l1 expression (Figure 1F&G). Additionally, we observed that Chi3l1 expression is significantly upregulated in specific pathogen-free (SPF) mice compared to germ-free mice (Figure 1A), demonstrating that the gut microbiome induces the expression of Chi3l1.

      Although we have not examined cell death or inflammatory responses, the protective role of Chi3l1 shown in Figure 5 suggests that any such responses would be mild and negligible. Regarding the role of peptidoglycan in the induction of Chi3l1 expression in DLD-1 cells, we have not yet explored this aspect. However, we agree with your suggestion that it would be worthwhile to investigate this in future experiments.

      We have also made the suggested modifications to the labeling (Figure 1A) and the clarification in the revised manuscript accordingly (page 3, Line 95-96; Line 102-106).

      Thank you again for your constructive feedback.

      (2) In Figure 2, the binding between Chi3l1 and PGN needs better characterization, regarding the affinity and how it compares with the binding between Chi3l1 and chitin. More importantly, it is unclear how this interaction could facilitate the colonization of gram-positive bacteria.

      Thank you for your insightful suggestions and we have performed the suggested experiments and included the results in the revised manuscript (Figure 2E-G, page 3-4, Line 132-146).

      Our results indicate that Chi3l1 interact with PGN in a dose-increase manner (Figure 2E). In contrast, the binding between Chi3l1 and chitin did not exhibit dose dependency (Figure 2E). These findings suggest a specific and distinct binding mechanism for Chi3l1 with PGN compared to chitin.

      We conducted DLD-1 cell-bacteria adhesion experiments, using GlmM mutant (PGN synthesis mutant) and K12 (wild-type) bacteria to test their adhesion capabilities. The results showed that the adhesion ability of the GlmM mutant to cells significantly decreased (Figure 2F). Additionally, after knocking down Chi3l1 in DLD-1 cells, we observed a decreased bacterial adhesion (Figure 2G). These findings suggest that Chi3l1 and PGN interaction plays a crucial role in bacterial adhesion.

      (3) In Figure 3, the abundance of furmicutes and other gram-positive species is lower in the knockout mice. What is the rationale for choosing lactobacillus in the following transfer experiments?

      We appreciate your thorough review. Among the Gram-positive bacteria that we have sequenced and analyzed, Lactobacillus occupies the largest proportion. Given the significant presence and established benefits of Lactobacillus, we chose it for the subsequent transfer experiments to leverage its known properties and availability, thereby ensuring the robustness and reproducibility of our findings.This is supported by the study referenced below.

      Lamas B, Richard ML, Leducq V, Pham HP, Michel ML, Da Costa G, Bridonneau C, Jegou S, Hoffmann TW, Natividad JM, Brot L, Taleb S, Couturier-Maillard A, Nion-Larmurier I, Merabtene F, Seksik P, Bourrier A, Cosnes J, Ryffel B, Beaugerie L, Launay JM, Langella P, Xavier RJ, Sokol H. CARD9 impacts colitis by altering gut microbiota metabolism of tryptophan into aryl hydrocarbon receptor ligands. Nat Med. 2016 Jun;22(6):598-605. doi: 10.1038/nm.4102. Epub 2016 May 9. PMID: 27158904; PMCID: PMC5087285.

      (4) FDAA-labeled E. faecalis colonization is decreased in the knockouts. Is it specific for E. faecalis, or it is generally true for all gram-positive bacteria? What about the colonization of gram-negative bacteria?

      Thank you for your insightful suggestions and we have investigated the colonization of gram-negative bacteria, OP50-mcherry (a strain of E.coli that express mCherry) and included the results in the updated manuscript (Supplementary Figure 3B, page 5, Line 197-200). We performed rectal injection of both wildtype and Chi11-/- mice with mCherry-OP50, and found that Chi11-/- mice had much higher colonization of E. coli compared to wildtype mice.

      (5) In Figure 5, the fact that FMT did not completely rescue the phenotype may point to the role of host cells in the processes. The reason that lactobacillus transfer did completely rescue the phenotypes could be due to the overwhelming protective role of lactobacillus itself, as the experiments were missing villin-cre mice transferred with lactobacillus.

      Thank you for your valuable feedback and thorough review. In our study, pretreatment with antibiotics in mice to eliminate gut microbiota demonstrated that IEC∆Chil1 mice exhibited a milder colitis phenotype (Supplementary Figure 4). This suggests that Chi3l1-expressing host cells are likely to play a detrimental role in colitis. Consequently, the failure of FMT to completely rescue the phenotype is likely due to the incomplete preservation of bacteria in the feces during the transfer experiment.

      We agree with your assessment of the protective role of lactobacillus. This also explains the significant difference in colitis phenotype between Villin-cre and IEC∆Chil1 mice (Figure 5B-E), as lactobacillus levels are significantly lower in IEC∆Chil1 mice (Figure 4F). Given the severity of colitis in Villin-cre mice at 7 days post-DSS, even if lactobacillus were transferred back to these mice, it is unlikely to result in a significant improvement.

      (6) Conflicting literature demonstrating the detrimental roles of Chi3l1 in mouse IBD model needs to be acknowledged and discussed.

      Thank you for your insightful suggestions and we have included additional discussions in the revised manuscript (page 6-7, Line 258-274).

      Reviewer #2 (Public Review):

      (1) Images are of great quality but lack proper quantification and statistical analysis. Statements such as "substantial increase of Chi3l1 expression in SPF mice" (Fig.1A), "reduced levels of Firmicutes in the colon lumen of IEC ∆ Chil1" (Fig.3F), "Chil1-/- had much lower colonization of E.faecalis" (Fig.4G), or "deletion of Chi3l1 significantly reduced mucus layer thickness" (Supplemental Figure 3A-B) are subjective. Since many conclusions were based on imaging data, the authors must provide reliable measures for comparison between conditions, as long as possible, such as fluorescence intensity, area, density, etc, as well as plots and statistical analysis.

      Thank you for your insightful suggestions and we have performed the suggested statistical analysis on most of the figures and included the analysis in the revised manuscript (Figure 1A, Figure 3E&F, Supplementary Figure 3B&C).Given large quantity of dietary fiber intertwined with bacteria, it is challenging to make a reliable quantification of bacteria in Figure 4G. However, it is easy to distinguish bacteria from dietary fiber under the microscope. We have exclusively analyzed gut sections from six mice in each group, and the results are consistent between the two groups.

      (2) In the fecal/Lactobacillus transplantation experiments, oral gavage of Lactobacillus to IECChil1 mice ameliorated the colitis phenotype, by preventing colon length reduction, weight loss, and colon inflammation. These findings seem to go against the notion that Chi3l1 is necessary for the colonization of Lactobacillus in the intestinal mucosa. The authors could speculate on how Lactobacillus administration is still beneficial in the absence of Chi3l1. Perhaps, additional data showing the localization of the orally administered bacteria in the gut of Chi3l1 deficient mice would clarify whether Lactobacillus are more successfully colonizing other regions of the gut, but not the mucus layer. Alternatively, later time points of 2% DSS challenge, after Lactobacillus transplantation, would suggest whether the gut colonization by Lactobacillus and therefore the milder colitis phenotype, is sustained for longer periods in the absence of Chi3l1.

      Thank you for your thorough review and insightful suggestions. Since we pretreated mice with antibiotics, the intestinal mucus layer is likely damaged according to a previous study (PMID: 37097253). Therefore, gavaged Lactobacillus cannot colonize in the mucus layer. Moreover, existing studies have shown that the protective effect of Lactobacillus is mainly derived from its metabolites or thallus components, rather than the living bacteria itself (PMID: 36419205, PMID: 27516254).

      Zhan M, Liang X, Chen J, Yang X, Han Y, Zhao C, Xiao J, Cao Y, Xiao H, Song M. Dietary 5-demethylnobiletin prevents antibiotic-associated dysbiosis of gut microbiota and damage to the colonic barrier. Food Funct. 2023 May 11;14(9):4414-4429. doi: 10.1039/d3fo00516j. PMID: 37097253.

      Montgomery TL, Eckstrom K, Lile KH, Caldwell S, Heney ER, Lahue KG, D'Alessandro A, Wargo MJ, Krementsov DN. Lactobacillus reuteri tryptophan metabolism promotes host susceptibility to CNS autoimmunity. Microbiome. 2022 Nov 23;10(1):198. doi: 10.1186/s40168-022-01408-7. PMID: 36419205.

      Piermaría J, Bengoechea C, Abraham AG, Guerrero A. Shear and extensional properties of kefiran. Carbohydr Polym. 2016 Nov 5;152:97-104. doi: 10.1016/j.carbpol.2016.06.067. Epub 2016 Jun 23. PMID: 27516254.

      Reviewer #3 (Public Review):

      The claim that mucus-associated Ch3l1 controls colonization of beneficial Gram-positive species within the mucus is not conclusive. The study should take into account recent discoveries on the nature of mucus in the colon, namely its mobile fecal association and complex structure based on two distinct mucus barrier layers coming from proximal and distal parts of the colon (PMID: ). This impacts the interpretation of how and where Ch3l1 is expressed and gets into the mucus to promote colonization. It also impacts their conclusions because the authors compare fecal vs. tissue mucus, but most of the mucus would be attached to the feces. Of the mucus that was claimed to be isolated from the WT and IEC Ch3l1 KO, this was not biochemically verified. Such verification (e.g. through Western blot) would increase confidence in the data presented. Further, the study relies upon relative microbial profiling, which can mask absolute numbers, making the claim of reduced overall Gram-positive species in mice lacking Ch3l1 unproven. It would be beneficial to show more quantitative approaches (e.g. Quantitative Microbial Profiling, QMP) to provide more definitive conclusions on the impact of Ch3l1 loss on Gram+ microbes.

      You raise an excellent point about the data interpretation, and we appreciate your insightful suggestions. We have included the discussion regarding the recent discoveries in the revised manuscript (page 7-8, Line 304-312). According to the recent discovery, the mucus in the proximal colon forms a primary encapsulation barrier around fecal material, while the mucus in the distal colon forms a secondary barrier. Our findings indicate that Chi3l1 is expressed throughout the entire colon, including the proximal, middle, and distal sections (See Author response image 1 below, P.S. Chi3l1 detection in colon presented in the manuscript are from the middle section). This suggests that Chi3l1 likely promotes bacterial colonization across the entire colon. Despite most mucus being expelled with feces, the

      constant production of mucus and the minimal presence of Chi3l1 in feces (Figure 4C) indicate that Chi3l1 continuously plays a role in promoting the colonization of microbiota.

      Author response image 1.

      Chi3l1 express in the proximal and distal colon. Immunofluoresence staining on proximal and distal colon sections to detect Chi3l1 (Red) expression. Nuclei were detected with DAPI (blue). Scale bars, 50um.

      Given the isolation method of the mucus layer, we followed the paper titled "The Antibacterial Lectin RegIIIγ Promotes the Spatial Segregation of Microbiota and Host in the Intestine" (PMID: 21998396). Although we did not find a suitable marker representative of the mucus layer for western blotting, we performed protein mass spectrometry on the isolated mucus layers and analyzed the data by comparing it with established research ("Proteomic Analyses of the Two Mucus Layers of the Colon Barrier Reveal That Their Main Component, the Muc2 Mucin, Is Strongly Bound to the Fcgbp Protein," PMID: 19432394). Our data showed a high degree of overlap with the proteins identified in established studies (see Author response image 2 below).

      Author response image 2.

      Comparison of mucus layer proteins identified by mass spectrometry between Our team and the Hansson team Mucus layer proteins identified by mass spectrometry between our team and the Hansson team (PMID: 19432394) are compared.

      Due to a lack of expertise, it has been challenging for us to perform reliable QMP experiments. However, since QMP involves qPCR combined with bacterial sequencing, we conducted 16S rRNA sequencing and confirmed the quantity of certain bacteria by qPCR (revised manuscript, Figure 3B, H, Figure 4E, F, Supplementary Figure 3A). Therefore, our data is reliable to some extent.

      Other weaknesses lie in the execution of the aims, leaving many claims incompletely substantiated. For example, much of the imaging data is challenging for the reader to interpret due to it being unfocused, too low of magnification, not including the correct control, and not comparing the same regions of tissues among different in vivo study groups. Statistical rigor could be better demonstrated, particularly when making claims based on imaging data. These are often presented as single images without any statistics (i.e. analysis of multiple images and biological replicates). These images include the LTA signal differences, FISH images, Enterococcus colonization, and mucus thickness.

      Thank you for your thorough review and insightful suggestions. We have performed the recommended statistical analysis on most of the figures and included the analysis in the revised manuscript (Figure 1A, Figure 3E&F, Supplementary Figure 3B&C). We have also added arrows in Figure 2B to make the figure easier to understand. Additionally, we repeated some key experiments to show the same regions of tissues among different groups. We will upload higher resolution figures during the revision. Thank you again for your constructive feedback.

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      It is recommended to change WT to SPF in the figure and text, as no genetic manipulation was involved in Figure 1.

      Thank you for your insightful suggestion. We have also made the suggested modifications to the labeling (revised manuscript, Figure 1A).

      Reviewer #2 (Recommendations For The Authors):

      The manuscript is well-written, but it would benefit from a critical reading to correct some typos and small grammar issues. Histological and IF images would be more informative if they contained arrows and labels guiding the reader's attention to what the authors want to show. More details about the structures shown in the figures should be included in the legends.

      Thank you for your thorough review and insightful suggestions. We have revised the manuscript to correct noticeable typos and grammar issues. Arrows have been added to Figure 2A&B to make the figures easier to understand. Additionally, we have included a detailed description of the structural similarities and differences between chitin and peptidoglycan in the figure legend ( revised manuscript, page 19, line 730-733).

      Minor points:

      • Page 1, line 36: Please correct "mice models" to "mouse models".

      Thank you for your insightful suggestion and we have made the suggested correction in the revised manuscript (page 1, line 41).

      • Page 3, line 110: "by comparing the structure of chitin with that of peptidoglycan (PGN), a component of bacterial cells walls, we observed that they have similar structures (Fig.2A)". Although both structures are shown side-by-side, no similarities are mentioned or highlighted in the text, figure, or legend.

      Thank you for your insightful suggestion and we have included a detailed description of the structural similarities and differences between chitin and peptidoglycan in the figure legend (revised manuscript, page 19, line 730-733).

      • Fig.5C and Fig.5G: y axis brings "weight (%)". I believe the authors mean "weight change (%)"?

      We agrees with your suggestion and has corrected the labeling according to your suggestion (revised manuscript, Figure 5C and G)

      • Page 8: Genotyping method is described as a protocol. Please modify it.

      Thank you for your constructive suggestion and we have modified the genotyping method in the revised manuscript (page 8, line 339-349)

      • Please expand on the term "scaffold model" used in the abstract and discussion.

      Thank you for your thorough review. In this model, Chi3l1 acts as a key component of the scaffold. By binding to bacterial cell wall components like peptidoglycan, Chi3l1 helps anchor and organize bacteria within the mucus layer. This interaction facilitates the colonization of beneficial bacteria such as Lactobacillus, which are important for gut health. We included more descriptions regarding scaffold model in the revised manuscript (page 6, line 248-250)

      • Discussion session often recapitulates results description, which makes the text repetitive.

      Thank you for your constructive suggestion and we have removed unnecessary results description in the discussion session in the revised manuscript.

      Reviewer #3 (Recommendations For The Authors):

      Major comments

      (1) Figure 1A. The staining is very faint, and hard to see. The reader cannot be certain those are Ch311-positive cells. Higher Mag is needed.

      Thank you for your insightful suggestion and we have included the higher resolution figures in the revised manuscript Figure 1A.

      (2) The mucus is produced largely by the proximal colon, is adherent to the feces, and mobile with the feces (PMID: 33093110). Therefore it is important to determine where the Ch311 is being expressed to be released into the lumen. Further Ch3l1 expression studies are needed to be done in both proximal and distal colon.

      Thank you for your thorough review and insightful suggestions. We have addressed this part in our public review. Additionally, we agree with your suggestions and will conduct further studies on Chi3l1 expression in both the proximal and distal colon.

      (3) Figure 1B. The image is out of focus for the Ileum, and the DAPI signal needs to be brought up for the colon. Which part of the colon is this? The UEA1+ cells do not really look like goblet cells. A better image with clearer goblet cells is needed.

      Thank you for your constructive suggestions. In the revised manuscript, we have included higher-resolution images (Figure 1B). The middle colon (approximately 3 to 4 cm distal from the cecum) was harvested for staining. In addition to UEA-1, we utilized anti-MUC2 antibody to label goblet cells in this colon segment (see Author response image 3 below). The patterns of goblet cells identified by UEA-1 or MUC2 antibodies are similar. The UEA-1-positive cells shown in Figure 1B are presumed to be goblet cells.

      Author response image 3.

      Goblet Cell Distribution in the Middle Colon. Goblet cells in the middle segment of the colon (approximately 3 to 4 cm distal from the cecum) were detected using immunofluorescence with antibodies against UEA-1 (green) and MUC2 (red). Scale bar=50μm. Representative images are shown from three mice individually stained for each antibody.

      (4) Figure 1G. There needs to be some counterstain or contrast imaging to show evidence that cells are present in the untreated sample.

      Thank you for your insightful suggestions. We have annotated the cells present in the untreated sample based on the overexposure in the revised manuscript (Figure 1G).

      (5) Figure 3B. Is this absolute quantification? How were the data normalized to allow comparison of microbial loads?

      Thank you for your thorough review. Figure 3B presents absolute quantification data based on the methodology described in the paper titled "The Antibacterial Lectin RegIIIγ Promotes the Spatial Segregation of Microbiota and Host in the Intestine" (PMID: 21998396). Briefly, we amplified a short segment (179 bp) of the 16S rRNA gene using conserved 16S rRNA-specific primers and OP50 (a strain of E. coli) as the template. After gel extraction and concentration measurement, the PCR products were diluted to gradient concentrations (0.16, 0.32, 0.64, 1.28, 2.56, 5.12, 10.24, 20.48 pg/µl). These gradient concentrations were used as templates for qPCR to generate a standard curve based on Ct values and bacterial concentration. The standard curve is used to calculate bacterial concentration in the samples. The data presented in Figure 3B represent the weight of bacteria/milligram sample, calculated as (bacterial concentration x bacterial volume) / (weight of feces or gut content).

      (6) Figure 3D. The major case is made for a dramatic reduction in Gram+ species, but Figure 1D does not show a dramatic change. Is this difference significant?

      Thank you for your thorough review. We don’t think we are clear about your question. However, there was no significant difference in Figure 3D. The dramatic reduction in Gram+ species are made based on the LTA, Firmicutes FISH, individual species comparison between WT and KO mice, bacterial QPCR results together (Figure 3E-H).

      (7) Figures 3E and 3F. These stainings are alone not convincing of reduced Gram+ in the KOs. Some stats are required for these images. An independent complementary method is also needed to quantify these with statistics since this data is so central to the study's conclusions.

      Thank you for your constructive suggestions. We have included statistical analysis in the revised manuscript (Figure 3E and F). Given large quantity of dietary fiber intertwined with bacteria, it is challenging to make a reliable quantification of bacteria in Figure 3E. However, it is easy to distinguish bacteria from dietary fiber under the microscope. We have exclusively analyzed gut sections from six mice in each group, and the results are consistent with the Firmicutes FISH results. Complementary method such as bacterial QPCR have been employed to quantify these (Figure 4E, F). Due to a lack of expertise, it has been challenging for us to perform reliable QMP experiments.

      (8) Figure 3G. To make quantitative conclusions, the authors need to do quantitative microbial profiling (QMP) of the microbiota. Relative abundance masks absolute numbers, which could be increased. There are qPCR-based QMP platforms the authors could use (PMID: PMIDs: 31940382, 33763385).

      Thank you for your constructive suggestions. Due to a lack of expertise, it has been challenging for us to perform reliable QMP experiments. However, since QMP involves qPCR combined with bacterial sequencing, we conducted 16S rRNA sequencing and confirmed the quantity of certain bacteria by qPCR (revised manuscript, Figure 3B, H, Figure 4E, F, Supplementary Figure 3A). In addition to the original bacterial qPCR data presented in the manuscript, we included another bacterial species, Turicibater. Consistent with the 16S rRNA sequencing analysis data, qPCR results showed that Turicibacter was more abundant in IECΔChil1 mice than Villin-cre mice (revised manuscript, supplementary Figure 3A, page 4, line 171-173) Therefore, our data is reliable to some extent.

      (9) Figure 4B. The data nicely shows Ch3l1 in mucus. However, no data supports the authors' main claim Ch3h1 binds Gram-positive bacteria in situ. Dual staining of Ch3l1 with Firmicutes probe would be supportive to show this interaction is happening in vivo.

      You raise an excellent point, and we agree with your suggestion that we should confirm Chi3l1 binding to Gram-positive bacteria in situ. During the study, we attempted dual staining of Chi3l1 with a universal bacterial 16S FISH probe several times, but we were unsuccessful. Despite various optimizations of the protocol, we were only able to detect bacteria, not Chi3l1. It appears that the antibody is not suitable for this method.

      (10) Figures 4D - F. Because mucus is associated with feces (PMID: ), the data with feces likely contains both Muc2/mucus and Feces. Therefore, it is unclear what the "mucus" is referring to in these figures. To support the authors' conclusions, there needs to be some validation that mucus was purified in the assays. This must be confirmed at a minimum by PAS staining on SDS PAGE gel (should be very high molecular weight) or Western blot with UEA lectin.

      Thank you for your insightful suggestions. As mentioned in the public review, the mucus layer was isolated following the protocol described in the paper titled "The Antibacterial Lectin RegIIIγ Promotes the Spatial Segregation of Microbiota and Host in the Intestine" (PMID: 21998396). Briefly, after harvesting the middle colon from the mice, we cut open the colon longitudinally. After removing the gut contents, the lumen was vigorously rinsed in PBS while holding one end with forceps. The pellet obtained after centrifuging the rinsate was used as our mucus sample. Fresh feces were collected immediately after the mice defecated in a new, empty cage. We performed Western blot analysis to detect UEA lectin but were unsuccessful.

      However, as noted in the public review, we conducted protein mass spectrometry on the isolated mucus layers and analyzed the data by comparing it with established research ("Proteomic Analyses of the Two Mucus Layers of the Colon Barrier Reveal That Their Main Component, the Muc2 Mucin, Is Strongly Bound to the Fcgbp Protein," PMID: 19432394). Our data showed a high degree of overlap with the proteins identified in these established studies.

      (11) Figure 4E/F: The units of measurement are in pg/cm2, implying picogram per area. Can the authors please explain what this unit is referring to?

      We are grateful for your thorough review. The unit pg/cm ² represents picograms per square centimeter. Figures 4E and 4F present absolute quantification data based on the methodology described in the paper titled "The Antibacterial Lectin RegIIIγ Promotes the Spatial Segregation of Microbiota and Host in the Intestine" (PMID: 21998396). Briefly, we harvested a 3x0.5 cm section of colon and a 9x0.4 cm section of ileum. And then we collected the mucus layer as previously described (responses to question 10). We measured bacterial concentration as described in response to question 5 using the equation (y = -1.53ln(x) + 13.581), where x represents the bacterial concentration and y represents the Ct value. After obtaining the bacterial concentration, we multiplied it by the volume of the rinsate and divided it by the area to obtain the values for pg/cm² used in the figures.

      (12) Figure 5E. Normal tissues appear to be from different colon regions from colitis tissues: the "Normal" looks like the proximal colon, while "Colitis" looks like the Distal colon. They cannot be directly compared.

      Thank you for your insightful suggestion. We have now included the updated image in the revised manuscript as Figure 5E to compare the same region of the colons.

      (13) Similarly, in Figure 5I it appears different colon regions are being compared between groups: Proximal colon in the bottom panels, and distal in the top panels. Since the proximal colon is less damaged by DSS, this data could be misleading.

      Thank you for your insightful suggestion. We have now included the updated image in the revised manuscript as Figure 5I to compare the same region of the colons.

      (14) In the DSS studies, are the VillinCre and IEC Chit3l1 mice co-housed littermates?

      Thank you for your insightful suggestion. In the DSS studies, the Villin-Cre and IECΔChil1 mice are not co-housed littermates. However, they are derived from the same lineage and are housed in the same rack within the same room of the animal facility.

      (15) Supplementary Figure 3: Mucus thickness images; are they representative? Stats are needed on multiple mice to support the claim that the mucus is thinner.

      Thank you for your insightful suggestion. The images are representative of 4 mice each group. We have now included the statistical analysis in the revised manuscript Supplementary Figure 3C&D.

      Minor

      (1) Introduction: Reference to "mucosal layer": "Mucosal" and "Mucus" are different things. "Mucosal" refers to the epithelium, lamina propria, and muscularis mucosa. "Mucus" refers to the secreted mucus gel, the focus of the authors' study. Therefore, the statement "mucosal layer" is not proper. "Mucosal layer" should be changed to "mucus layer."

      Thank you for your constructive suggestions and we have learned a lot from it. We have made the replacement of “mucosal layer” to “mucus layer in the revised manuscript.

      (2) Line 366 and related lines: Feces cannot be "dissolved". "Resuspended" is a better term.

      Thank you for your constructive suggestion and we have made the changes of “dissolved” to “resuspended” in the revised manuscript.

      (3) Lines 36-37 and 43-44 are redundant to each other.

      Thank you for your constructive suggestion and we have removed the lines 36-37 in the revised manuscript.

    1. Author response:

      The following is the authors’ response to the current reviews.

      Reviewer #1:

      Summary:

      The authors study age-related changes in the excitability and firing properties of sympathetic neurons, which they ascribe to age-related changes in the expression of KCNQ (Kv7, "M-type") K+ currents in rodent sympathetic neurons, whose regulation by GPCRs has been most thoroughly studied for over 40 years.

      Strengths:

      The strengths include the rigor of the current-clamp and voltage-clamp experiments and the lovely, crisp presentation of the data, The separation of neurons into tonic, phasic and adapting classes is also interesting, and informative. The ability to successfully isolate and dissociate peripheral ganglia from such older animals is also quite rare and commendable! There is much useful detail here.

      Thank you for recognizing the effort we put on presenting the data and analyzing the neuronal populations. I also believe the ability to isolate neurons from old animals is worth communicating to the scientific community.

      Weaknesses:

      Where the manuscript becomes less compelling is in the rapamycin section, which does not provide much in the way of mechanistic insights. As such, the effect is more of an epi-phenomenon of unclear insight, and the authors cannot ascribe a signaling mechanism to it that is supported by data. Thus, this latter part rather undermines the overall impact and central advance of the manuscript. The problem is exacerbated by the controversial and anecdotal nature of the entire mTor/aging field, some of whose findings have very unfortunately had to be recently retracted.

      I would strongly recommend to the authors that they end the manuscript with their analysis of the role of M current/KCNQ channels in the numerous age-related changes in sympathetic neuron function that they elegantly report, and save the rapamycin, and possible mTor action, for a separate line of inquiry that the authors could develop in a more thorough and scholarly way.

      Whereas the description of the data are very nice and useful, the manuscript does not provide much in the way of mechanistic insights. As such, the effect is more of an epi-phenomenon of unclear insight, and the authors cannot ascribe changes in signaling mechanisms, such as that of M1 mAChRs to the phenomena that is supported by data.

      I appreciate the new comment. We had agreed that our rapamycin experiments did not allow to ascribe the mechanism to the signaling pathway of mTOR. The new comment mentions M1 mAChRs signaling as another potential signaling mechanism. Our work centered on determining whether aging altered the function of sympathetic motor neurons and defining the mechanism. We presented evidence showing that the mechanism is a reduction of the M-current. We did not attempt to identify the signaling mechanism linking aging to a reduction in M-current. Therefore, we agree with the reviewer that we do not provide further details on the mechanism and that that remains an open question. However, I find it harsh to say that “the effect is more of an epiphenomenon of unclear insight”. How could we possibly test that the effect of aging on the excitability of these neurons only arises as a secondary effect or that is not causal? How could we test for sufficiency and necessity of aging? How could we modify the state of aging to test for causality? We would have to reverse aging and show that the effect on the excitability is gone. And that is exactly what we tried to do with the rapamycin experiment.

      Reviewer #1 (Recommendations For The Authors):

      (1) The significance values greater than p < 0.05 do not add anything and distract focus from the results that are meaningful. Fig. 5 is a good example. What does p = 0.7 mean? Or p = 0.6? Does this help the reader with useful information?

      I thank Reviewer 1 for raising this question. We have attempted different versions of how we report p values, as we want to make sure to address rigor and transparency in reporting data. As corresponding author, I favor reporting p values for all statistical comparisons. To help the reader identifying what we considered statistically significant, we color coded the p values, with red for p-value<0.05 and black for p-value>0.05. As a reader, seeing a p-value=0.7 allows me to know that the authors performed an analysis comparing these conditions and found the mean not to be different. Not presenting the p-value makes me wonder whether the authors even analyzed those groups. In other words, I value more the ability to analyze the data seeing all p-values than not being distracted by not-significant p-values. This is just my preference.

      (2) Fig. 1 is not informative and should be removed.

      I thank Reviewer 1 for the suggestion. In previous drafts of the manuscript, this figure was included only as a panel. However, we decided it was better to guide the reader into the scope of our work. This is part of our scientific style and, therefore, we prefer to keep the figure.

      (3) The emphasis on a particular muscarinic agonist favored by many ion channel physiologists, oxotremorine, is not meaningful (lines 192, 198). The important point is stimulation of muscarinic AChRs, which physiologically are stimulated by acetylcholine. The particular muscarinic agonist used is unimportant. Unless mandated by eLife, "cholinergic type 1 muscarinic receptors" are usually referred to as M1 mAChRs, or even better is "Gq-coupled M1 mAChRs." I don't think that Kruse and Whitten, 2021 were the first to demonstrate the increase in excitability of sympathetic neurons from stimulation of M1 mAChRs. Please try and cite in a more scholarly fashion.

      A) I have modified lines 192 and 198 removing mention to oxotremorine.

      B) I have modified the nomenclature used to refer to cholinergic type 1 muscarinic receptors.

      C) I cited references on the role of M current on sympathetic motor neuron excitability. I also removed the reference (Kruse and Whitten, 2021) referring only on the temporal correlation between the decrease of KCNQ current with excitability.

      (4) The authors may want to use the term "M current" (after defining it) as the current produced by KCNQ2&3-containing channels in sympathetic neurons, and reserve "KCNQ" or "Kv7" currents as those made by cloned KCNQ/Kv7 channels in heterologous systems. A reason for this is to exclude currents KCNQ1-containing channels, which most definitely do not contribute to the "KCNQ" current in these cells. I am not mandating this, but rather suggesting it to conform with the literature.

      Thank you for the suggestion. I have modified the text to use the term M current. I maintain the use of KCNQ only when referring to KCNQ channel, such as in the section describing the abundance of KCNQ2.

      (5) The section in the text on "Aging reduces KCNQ current" is confusing. Can the authors describe their results and their interpretation more directly?

      I am not sure to understand the request. I assumed point 5 and 6 are related and decided to answer point 6.

      (6) Please explain the meaning of the increase in KCNQ2 abundance with age in Fig. 6G. How is this increase in KCNQ2 expression consistent with an increase in excitability? The explanation of "The decrease in KCNQ current and the increase in the abundance of KCNQ2 protein suggest a potential compensatory mechanism that occurs during aging, which we are actively investigating in an independent study." is rather odd, considering that the entire thesis of this paper is that changes in excitability and firing properties are underlied by changes in KCNQ2/3 channel expression/density. Suddenly, is this not the case?? What about KCNQ3? It would be very enlightening if the authors would just quantify the ratio of KCNQ2:KCNQ3 subunits in M-type channels in young and old mice using simple TEA dose/response curves (see Shapiro et al., JNS, 2000; Selyanko et al., J. Physiol., Hadley et al., Br. J. Pharm., 2001 and a great many more). It is also surprising that the authors did not assess or probe for differences in mAChR-induced suppression of M current between SCG neurons of young and old mice. This would seem to be a fundamental experiment in this line of inquiry.

      A. Please explain the meaning of the increase in KCNQ2 abundance with age in Fig. 6G. How is this increase in KCNQ2 expression consistent with an increase in excitability? The explanation of "The decrease in KCNQ current and the increase in the abundance of KCNQ2 protein suggest a potential compensatory mechanism that occurs during aging, which we are actively investigating in an independent study." is rather odd, considering that the entire thesis of this paper is that changes in excitability and firing properties are underlied by changes in KCNQ2/3 channel expression/density. Suddenly, is this not the case?? Our interpretation is that the decrease in M current is not caused by a decrease in the abundance of KCNQ (2) channels. We do not claim that changes in excitability are underlied by a reduction in the expression or density of KCNQ2 channels. On the contrary, our working hypothesis is that the reduction in M current is caused by changes in traffic, degradation, posttranslational modifications, or cofactors for KCNQ2 or KCNQ3 channels. We have modified the description in the results section to clarify this concept.

      B. What about KCNQ3? Unfortunately, we did not find an antibody to detect KCNQ3 channels. I have added a sentence to state this.

      C. KCNQ2:KCNQ3 subunits in M-type channels in young and old mice using simple TEA dose/response curves. This is a great idea. Thank you for the suggestion. Is this a necessary experiment for the acceptance of this manuscript?

      D. It is also surprising that the authors did not assess or probe for differences in mAChR-induced suppression of M current between SCG neurons of young and old mice. This would seem to be a fundamental experiment in this line of inquiry. Reviewer 1 is correct. We did not assess for differences in the suppression of M current by mAChR activation. We do not see the connection of this experiment with the scope of the current investigation.

      (7) Why do the authors use linopirdine instead of XE-991? Both are dirty drugs hardly specific to KCNQ channels at 25 uM concentrations, but linopirdine less so. The Methods section lists the source of XE991 used in the study, not linopirdine. Is there an error?

      A. Why do the authors use linopirdine instead of XE-991? After validation of KCNQ2/3 inhibition by Linopirdine, we found the effect on membrane potential recordings to be reproducible. Linopirdine has also been reported to be reversible. We wanted to assess reversibility on the excitability of young neurons. We did not find the effect to be reversible. We performed experiments applying XE-991 while recording the membrane potential. XE-991 did not show a clear effect. I was not surprised by this. It is very likely that the pharmacological inhibition of one channel leads to the activation of other channel types. This is highlighted in the work by Kimm, Khaliq, and Bean, 2015. “Further experiments revealed that inhibiting either BK or Kv2 alone leads to recruitment of additional current through the other channel type during the action potential as a consequence of changes in spike shape.” In fact, it was quite remarkable that the aged and young phenotypes were mimicked by targeting KCNQ pharmacologically.

      B. Both are dirty drugs hardly specific to KCNQ channels at 25 uM concentrations, but linopirdine less so. I have added a sentence to point out that linopirdine is less potent than XE-991. It reads: “We want to point out that linopirdine is less potent than XE-991 and that it has been reported to activate TRPV1 channels (Neacsu and Babes, 2010). Despite this limitation, the application of linopirdine to young sympathetic motor neurons led to depolarization and firing of action potentials.”

      C. The Methods section lists the source of XE991 used in the study, not linopirdine. Is there an error? Thank you for pointing out this. I have added information for both retigabine and linopirdine in the Methods section, both were missing.

      (8) Can the authors use a more scientific explanation of RTG action than "activating KCNQ channels?" For instance, RTG induces both a negative-shift in the voltage-dependance of activation and a voltage-independent increase in the open probability, both of which differing in detail between KCNQ2 and KCNQ3 subunits. The authors are free to use these exact words. Thus, the degree of "activation" is very dependent upon voltage at any voltages negative to the saturating voltages for channel activation.

      I have modified the text to reflect your suggestion.

      (9) Methods: did the authors really use "poly-l-lysine-coated coverslips?" Almost all investigators use poly-D-lysine as a coating for mammalian tissue-culture cells and more substantial coatings such as poly-D-lysine + laminin or rat-tail collagen for peripheral neurons, to allow firm attachment to the coverslip.

      That is correct. We used poly-L-lysine-coated coverslips. Sympathetic motor neurons do not adhere to poly-D-Lysine.

      (10) As a suggestion, sampling M-type/KCNQ/Kv7 current at 2 kHz is not advised, as this is far faster than the gating kinetics of the channels. Were the signals filtered?

      It is correct. Currents were sampled at 2KHz. Data were low-pass filtered at 3 KHz. Our conditions are not far from what is reported by others. Some sample at 10KHz and even 50 KHz. Others do not report the sample frequency.

      Reviewer #2:

      Weaknesses:

      None, the revised version of the manuscript has addressed all my concerns.

      I am glad we were able to satisfy previous concerns.

      Reviewer #3:

      The main weakness is that this study is a descriptive tabulation of changes in the electrophysiology of neurons in culture, and the effects shown are correlative rather than establishing causality.

      Allow me to clarify our previous responses and determine how this aligns with your concerns. In the previous revision, Reviewer 3 wrote: “It is difficult to know from the data presented whether the changes in KCNQ channels are in fact directly responsible for the observed changes in membrane excitability.” And suggested to “use of blockers and activators to provide greater relevance.” I assumed these comments were the main concern and that doing such experiments was enough to satisfy the criticism. It is discouraging to see that our experiments did not satisfy the concerns of the reviewer of being correlative.

      If Reviewer 3 is referring to stablishing causality between aging and a reduction in M current, I would like to emphasize that such endeavor is complicated as there is not a clear experiment to solve that issue. Our best attempt was to reverse aging with rapamycin, but the recommendation was to remove those experiments.

      … but the specifics of the effects and relevance to intact preparations are unclear. Additional experiments in slice cultures would provide greater significance on the potential relevance of the findings for intact preparations.

      I apologize for missing this point in the previous revision. The proposed experiments will require an upward microscope coupled to an electrophysiology rig. Unfortunately, I do not have the equipment to do these experiments.

      Summary of recommendations from the three reviewers:

      Please make corrections as suggested by reviewer 1 to improve the manuscript. Specifically, reviewer 1 suggests making changes to p values in Figure 5,

      It is not clear what the suggested changes are. The comment from Reviewer 1 says: The significance values greater than p < 0.05 do not add anything and distract focus from the results that are meaningful. If the suggested change is to remove p values > 0.05, I have explained my rational for keeping those values. If the Journal has a specific format on how to report p-values, I will be happy to make appropriate changes.

      and the importance of citing original scholarly works related to effects of increase in excitability of sympathetic neurons by M1 receptors, and the terminology for M currents and KCNQ currents. These changes will improve the manuscript and are strongly recommended.

      I cited original papers on that area, and changed the terminology for M current. I kept KCNQ when referring to the channel protein or abundance.

      The section dealing with Aging Reduces KCNQ currents seems to contain a lot of extraneous information especially in the last part of the long paragraph and this section should be rewritten for improved clarity… and - the implications or lack thereof - of the correlation of KCNQ with AP firing rates.

      A. I removed extraneous information in that section. It now reads: Previous work by our group and others demonstrated that cholinergic stimulation leads to a decrease in M current and increases the excitability of sympathetic motor neurons at young ages \cite{RN67,RN68,RN69,RN71, RN72, RN73, RN74, RN75}. The molecular determinants of the M current are channels formed by KCNQ2 and KCNQ3 in these neurons \cite{RN76, RN77, RN70}. Thus, Figure 6A shows a voltage response (measured in current-clamp mode) and a consecutive M current recording (measured in voltage-clamp mode) in the same neuron upon stimulation of cholinergic type 1 muscarinic receptors. It illustrates the temporal correlation between the decrease of M current with the increase in excitability and firing of APs upon activation with oxotremorine. This strong dependence led us to hypothesize that aging decreases M current, leading to a depolarized RMP and hyperexcitability (Figure 6B). For these experiments, we measured the RMP and evoked activity using perforated patch, followed by the amplitude of M current using a whole-cell voltage clamp in the same cell. We also measured the membrane capacitance as a proxy for cell size. Interestingly, M current density was smaller by 29\% in middle age (7.5 ± 0.7 pA/pF) and by 55\% in old (4.8 ± 0.7 pA/pF) compared to young (10.6 ± 1.5 pA/pF) neurons (Figure 6C-D). The average capacitance was similar in young (30.8 ± 2.2 pF), middle-aged (27.4 ± 1.2 pF), and old (28.8 ± 2.3 pF) neurons (Figure 6E), suggesting that aging is not associated with changes in cell size of sympathetic motor neurons, and supporting the hypothesis that aging alters the levels of M current. Next, we tested the effect on the abundance of the channels mediating M current. Contrary to our expectation, we observed that KCNQ2 protein levels were 1.5 ± 0.1 -fold higher in old compared to young neurons (Figure 6F-G). Unfortunately, we did not find an antibody to detect consistently KCNQ3 channels. We concluded that the decrease in M current is not caused by a decrease in the abundance of KCNQ2 protein.

      B. and - the implications or lack thereof - of the correlation of KCNQ with AP firing rates. I am not sure to understand the request on the section of the correlation of KCNQ with AP firing rate. I divided the long paragraph.

      The apparent lack of correlation between KCNQ current and KCNQ2 protein needs to be better explained. This is a central part of the study and this result undercuts the premise of the paper.

      Indeed, total KCNQ2 protein abundance increases while M current decreases. We do not claim in our work that changes in excitability are caused by a reduction in the expression or density of KCNQ2 channels. On the contrary, our current working hypothesis is that the reduction in M current is caused by changes in traffic, degradation, posttranslational modifications, or cofactors for KCNQ2 or KCNQ3 channels. I have modified the description in the results section and discussion to clarify this concept.

      Additionally, the poor specificity of Linordipine for KCNQ should be pointed out in the limitations.

      I pointed this limitation. It reads: We want to point out that linopirdine is less potent than XE-991 and that it has been reported to activate TRPV1 channels (Neacsu and Babes, 2010). Despite this limitation, the application of linopirdine to young sympathetic motor neurons led to depolarization and firing of action potentials.

      Finally, the editor notes that the author response should not contain ambiguities in what was addressed in the revision. In the original summary of consolidated revisions that were requested, one clearly and separately stated point (point 4) was that experiments in slice cultures should be strongly considered to extend the significance of the work to an intact brain preparation. The author response letter seems to imply that this was done, but this is not the case. The author response seems to have combined this point with another separate point (point 3) about using KCNQ drugs, and imply that all concerns were addressed. Authors should be clear about what revisions were in fact addressed.

      As corresponding author, and direct responsible of the document provided for the reply to the reviewers, I apologize for my mistake. After reviewing this comment, I realized I did not respond to the Major points in the section of the Recommendations for the authors from Reviewer 3. I missed that entire section. My previous responses addressed the Public review of reviewer 3. When doing so, I did not separate the sentences, omitting the request on performing the experiment in slices.


      The following is the authors’ response to the original reviews.

      Reviewer #1

      Summary:

      The authors study age-related changes in the excitability and firing properties of sympathetic neurons, which they ascribe to age-related changes in the expression of KCNQ (Kv7, "M-type") K+ currents in rodent sympathetic neurons, whose regulation by GPCRs has been most thoroughly studied for over 40 years. The authors suggest the ingestion of rapamycin may partially reverse the age-related decrease in M-channel expression. With the rapamycin part included, it is unclear how this work will impact the field of age-related neuronal dysfunction, as the mechanistic information is not strong.

      Strengths:

      The strengths include the rigor of the current-clamp and voltage-clamp experiments, the lovely, crisp presentation of the data, and the expert statistics. The separation of neurons into tonic, phasic, and adapting classes is also interesting, and informative. The writing is also elegant, and crisp. The above is especially true of the manuscript up until the part dealing with the effects of rapamycin, which becomes less compelling.

      We appreciate the thoughtful comments and constructive feedback to improve the impact of the manuscript.

      Weaknesses:

      Where the manuscript becomes less compelling is in the rapamycin section, which does not provide much in the way of mechanistic insights. As such, the effect is more of an epi-phenomenon of unclear insight, and the authors cannot ascribe a signaling mechanism to it that is supported by data. Thus, this latter part rather undermines the overall impact and central advance of the manuscript. The problem is exacerbated by the controversial and anecdotal nature of the entire mTor/aging field, some of whose findings have very unfortunately had to be recently retracted.

      I would strongly recommend to the authors that they end the manuscript with their analysis of the role of M current/KCNQ channels in the numerous age-related changes in sympathetic neuron function that they elegantly report, and save the rapamycin, and possible mTor action, for a separate line of inquiry that the authors could develop in a more thorough and scholarly way.

      We agree with the reviewer in that we cannot ascribe a signaling mechanism to the reversibility observed with rapamycin. Therefore, we are following the recommendation of the reviewer and have removed the rapamycin section.

      We want to emphasize that, in the aging field, any advancement in the knowledge of how drugs such as rapamycin reverse age-associated phenotypes is of crucial importance. These drugs, commonly referred to as aging interventions, include rapamycin, calorie restriction, elamipretide, and metformin. We could have used any of these interventions. And yet, the cellular and molecular mechanisms for each one of these anti-aging drugs are unknown.

      We want to note that, although the nature of the mTOR field is controversial, the effect of rapamycin in extending lifespan and improving health is not. At least these authors have not been able to find retracted papers on that subject or notices from the NIA alerting on this issue. We kindly request the reviewer to provide the references related to rapamycin that were retracted so we can evaluate how that affects the rigor of the premise for our future work.

      As authors, we also find it important to note that we are confident of our observations regarding the effect of rapamycin, and that we are not removing this section because we are retracting our claims. We will use these data to continue our research of the mechanism behind the effect of aging on sympathetic motor neurons.

      Reviewer #2:

      Summary:

      This research shows compelling and detailed evidence showing that aging influences intrinsic membrane properties of peripheral sympathetic motor neurons such that they become more excitable. Furthermore, the authors present convincing evidence that the oral administration of the anti-aging drug Rapamycin partially reversed hyperexcitability in aged neurons. This study also investigates the molecular mechanisms underlying age-associated hyperexcitability in mouse sympathetic motor neurons. In that regard, the authors found an age-associated reduction of an outward current having properties similar to KCNQ2/Q3 potassium current. They suggested a reduction of KCNQ2/Q3 current density in aged neurons as a potential mechanism behind their overactivity.

      Strengths:

      Detailed and rigorous analysis of electrical responses of peripheral sympathetic motor neurons using electrophysiology (perforated patch and whole-cell recordings). Most of the conclusions of this paper are well supported by the data.

      We thank the reviewer for valuing our effort to present a detailed and rigorous analysis.

      Weaknesses:

      (1) The identity of the age-associated reduced current as KCNQ2/Q3 is not corroborated by pharmacology (blocking the current with the specific blocker XE-991).

      We have performed experiments using blockers of KCNQ channels. See responses below.

      (2) The manuscript does not include a direct test of the reduction of KCNQ current as the mechanism behind age-induced hyperexcitability.

      Thank you for raising this point. We have performed experiments blocking KCNQ channels with Linopiridine in young neurons and found that the pharmacological reduction of KCNQ current was enough to depolarize the cell and, in some cases, elicit the firing of action potentials. We present the results in a new figure. We also added the description in the Results section.

      Reviewer #3:

      This is a descriptive study of membrane excitability and Na+ and K+ current amplitudes of sympathetic motor neurons in culture. The main findings of the study are that neurons isolated from aged animals show increased membrane excitability manifested as increased firing rates in response to electrical stimulation and changes in related membrane properties including depolarized resting membrane potential, increased rheobase, and spontaneous firing. By contrast, neuron cultures from young mice show little to no spontaneous firing and relatively low firing rates in response to current injection. These changes in excitability correlate with significant reductions in the magnitude of KCNQ currents in aged neurons compared to young neurons. Treating cultures with the immunosuppressive drug, rapamycin, which has known antiaging effects in model animals appears to reverse the firing rates in aged neurons and enhance KCNQ current. The authors conclude that aging promotes hyperexcitability of sympathetic motor neurons.

      The electrophysiological cataloging of the neuronal properties is generally well done, and the experiments are performed using perforated patch recordings which preserve the internal constituents of neurons, providing confidence that the effects seen are not due to washout of regulators from the cells.

      The main weakness is that this study is a descriptive tabulation of changes in the electrophysiology of neurons in culture, and the effects shown are correlative rather than establishing causality. It is difficult to know from the data presented whether the changes in KCNQ channels are in fact directly responsible for the observed changes in membrane excitability.

      We appreciate the constructive criticism. In an attempt to assess whether changes in KCNQ are in fact directly responsible for the changes in membrane excitability, we have performed experiments blocking KCNQ channels with Linopirdine in young neurons and found that the pharmacological reduction of KCNQ current was enough to depolarize the cell and, in some cases, elicit the firing of action potentials. Conversely, we activated KCNQ channels in old neurons with retigabine and found that the pharmacological activation was enough to hyperpolarize the membrane potential and stop the firing of action potentials. This effect was reversible. These two experiments provide solid evidence to our statement that age-associated reduction of KCNQ activity is responsible for the hyperexcited state in sympathetic motor neurons. We present the results in a new figure (Figure 8). We also added the description in the Results section.

      Furthermore, a notable omission seems to be the analysis of Ca2+ currents which have been widely linked to alterations in membrane properties in aging.

      We thank the reviewer for the comment. We did omit to include data on our studies of calcium currents. We agree that the study of the effect of calcium currents is relevant as it can influence the afterhyperpolarization. Furthermore, we believe that potential effects on calcium currents need to be studied in relation to other physiological processes that depend on calcium, including excitation-transcription coupling, calcium handling, and neurotransmitter release. Adding this information to this manuscript would only contribute to the tabulation of effects that we observe in sympathetic motor neurons with aging. As our main goal was to determine the ion channels responsible for the hyperexcited state, voltage-gated calcium channels or other calcium sources could have reflected a more indirect mechanism as compared to changes in sodium or potassium currents. We will continue our investigation on calcium currents and report our observations in the future, but for now, we have decided to leave it out of this work.

      As well, additional experiments in slice cultures would provide greater significance on the potential relevance of the findings for intact preparations. Finally, experiments using KCNQ blockers and activators could provide greater relevance that the observed changes in KCNQ are indeed connected to changes in membrane excitability.

      We are happy to report that we have performed these experiments and that the results strengthen the conclusion that changes in KCNQ are connected to changes in membrane excitability.

      Recommendations for the authors:

      We recommend the following essential revisions summarized from the reviews:

      (1) Is the change in KCNQ current responsible for the altered membrane excitability? What happens to membrane excitability when KCNQ is partially blocked (see reviewer 2 comment below)? Conversely, what happens to the excitability of aged neurons if KCNQ is activated (e.g., with retigabine)? (see reviewer 3 comment below). Results of these important experiments are needed to support the argument that KCNQ underlies the alterations in firing and membrane excitability.

      We have responded to this point. Thank you for the suggested experiments. In summary, the new experiments show that blocking KCNQ channels in young neurons lead to depolarization, and in some cases, the firing of action potentials. Conversely, the activation of KCNQ channels in aged neurons leads to hyperpolarization and a cease of firing. We have added a new figure and reported the results in the Results section.

      (2) Rapamycin experiments are underdeveloped and weak. These should be further developed by examining the effects of KCNQ blockers to see if their effects on membrane excitability are reversed. Also, see comment 2 from reviewer 1.

      We have followed the recommendation by reviewer 1 and removed the section on rapamycin.

      (3) The study should examine voltage-gated calcium currents to determine potential changes in these currents with aging. See reviewer 3 comments.

      We thank the reviewer for the comment. We performed preliminary experiments and found that aging impacts calcium currents. However, we omitted to include the data. In our opinion, the changes in calcium currents are outside the scope of this work, as the changes could be related to physiological processes that go beyond the control of firing. Effects on calcium currents need to be studied in relation to other physiological processes that depend on calcium, including excitation-transcription coupling, calcium handling, and neurotransmitter release. The study of the relationship between changes in calcium currents and those physiological processes would require multiple experiments and detailed analysis. We will continue our investigation on calcium currents and report our observations in the future, but for now, we have decided to leave it out of this work.

      We have also edited suggestions in the Figures and Legends.

      (2) In Fig.4 panel H, Y-axis must be # AP at 100 pA.

      We corrected the axis in Figure 4H.

      (3) In Legend Fig. 5, the number of cells for each subpopulation (n) needs to be corrected. In plots F-I, n= 9, 7, and 3 seem to be the number of adapting cells for 12-, 64- and 115w-old, respectively, instead of the number of single, phasic, and old cells for 12-week-old mice. A similar correction seems to be needed for 64-week-old and 115-week-old.

      We corrected the n number in Figure 5.

      (4) In Figure 6 panel C, it would be helpful for a reader to align the voltage protocol depicted with the current shown.

      We have aligned the voltage protocol to the current traces.

      (5) In the legend of Figure 7, the description of panel A ends with "Magnitude of voltage step to elicit each trace is shown in black", however in panel A there is no voltage depiction. In the description of panel D, "N = X animals, n=x cells" must be corrected.

      We have modified the legend to clarify. It now reads: “Text at the right of each current trace corresponds to the voltage used to elicit that current.”

      New Figure 8

      Author response image 1.

      Pharmacological inhibition and activation of KCNQ channels mimic the age-dependent phenotype. A. Membrane potential recordings from two young neurons treated with 25 μM linopirdine during the time illustrated by the light gray box. No holding current was applied. B. Left: Summary of the resting membrane potential measured before (light orange) and after (dark orange) the application of linopirdine. Right: Summary of the depolarization produced by linopirdine calculated by subtracting the post-drug voltage from the pre-drug voltage (V). Data points are from N = 2 animals, n = 8 cells, 14-week-old mice. C. Membrane potential recordings from two aged neurons treated with 10 μM retigabine during the time illustrated by the light gray box. No holding current was applied. D. Left: Summary of the resting membrane potential measured before (light purple) and after (dark purple) the application of retigabine. Right: Summary of the hyperpolarization produced by retigabine calculated by subtracting the post-drug voltage from the pre-drug voltage (V). Data points are from N = 2 animals, n = 7 cells, 120-week-old mice. P-values are shown at the top of the graphs.

    1. Author Response

      The following is the authors’ response to the current reviews.

      Joint Public Review

      This study is concerned with the general question as to how pools of synaptic vesicles are organized in presynaptic terminals to support different types of transmitter release, such as fast synchronous and asynchronous release. To address this issue, the authors employed the classical method of load- ing synaptic vesicle membranes with FM-styryl dyes and assessing dye destaining during repetitive synapse stimulation by live imaging as a readout of the mobilization of vesicles for fusion. Among other 1ndings, the authors provide evidence indicating that there are multiple reserve vesicle pools, that quickly and slowly mobilized reserves do not mix, and that vesicle fusion does not follow a mono-exponential time course, leading to the notion that two separate reserve pools of vesicles - slowly vs. rapidly mobilizing - feed two distinct releasable pools - reluctantly vs. rapidly releasing. These 1ndings are valuable to the 1eld of synapse biology, where the organization of synaptic vesicle pools that support synaptic transmission in different temporal and stimulation regimes has been a focus of intense experimentation and discussion for more than two decades.

      On the other hand, the present study has limitations, so that the authors’ key conclusions remain incompletely supported by the data, and alternative interpretations of the data remain possible. The approach of using bulk FM-styryl dye destaining as a readout of precise vesicle arrangements and pools in a population of functionally very diverse synapses bears problems. In essence, the approach is ’blind’ to many additional processes and confounding factors that operate in the back- ground, from other forms of release to inter-synaptic vesicle exchange. Further, averaging signals over many - functionally very diverse - synapses makes it diicult to distinguish the dynamics of separate vesicle pools within single synapses from a scenario where different kinetics of release originate from different types of synapses with different release probabilities.

      We thank the editors and reviewers for their time and patience, and are happy that they found our results valuable.

      We do not have a clear understanding of what the alternative interpretations might be - beyond those already addressed - but would like to. At present, we believe that the evidence for parallel processing of slowly and quickly mobilized reserve vesicles is solid and hope that people who are open to the possibility will evaluate the reasoning described within our report. The hypothesis that reserves are kept separate because they feed distinct subdivisions of the readily releasable pool remains to be tested.

      Beyond that, we have used FM-dye de-staining as a bulk measurement of sub-synaptic events in the sense that we have made no attempt to measure mobilization of isolated individual vesicles. We do not see how this necessarily leaves viable alternative interpretations, but this is diZcult to evaluate without knowing what the alternatives might be. On the other hand, the FM-dye technique has had good resolution at the level of distinguishing between individual synapses since at least Murthy et al. (2001). For our part, we are con1dent that our analysis in Figure 3 combined with the results in Figures 4-11 shows that the multiple reserve pools co-occur in many individual presynaptic terminals. We did not use electron microscopy to con1rm that all of the punctae analyzed in Figure 3 were indeed single synapses, but the reviewers did not recommend this, and we believe there is already enough published about the spatial distribution of synapses in cell culture to be con1dent that many of the punctae that are smaller than 1.5 µm were individuals.

      Overall, we have attempted to address all of the individual concerns raised by reviewers, and our understanding is that these concerns and our responses will be available on the eLife website. The reviewers were not convinced on every point, but these are cases where the nature of the concern was not clear to us. We hope that people who share these concerns will check out our responses and contact us with any further questions or alternative interpretations.

      (1) The authors sincerely addressed many of the previous concerns, mainly by clari1cation. The data are consistent with the authors’ hypothesis. The pool concept is somewhat similar to that of Richards et al (2000) and Rey et al (2015). The authors further propose that two reserve pools feed vesicles to two readily-releasable pools independently.

      To clarify further: The possibility that distinct reserve pools feed distinct readily releasable pools is predicted by our working model, and is something that we would like to test in the future, but is not a conclusion of the present study. Instead, in the present study, we tested the prediction that quickly and slowly mobilized reserve vesicles are processed in parallel without making assumptions about the the underlying mechanism.

      Unfortunately, the heterogeneity among individual synapses remains a concern as shown in (some of) the raw data (Fig. 3 and supplements).

      We emphasize that we have not attempted to minimize the extensive heterogeneity among synapses, but actually highlight this. In fact, we chose the image in Figure 3 for an example in part because of the lower left region replicated in Figure 3 supplement 2 demonstrating extensive heterogeneity along what appears to be a single axon. We are not the 1rst to notice the heterogeneity (see Waters and Smith, 2002), but we do provide a new possible explanation which, if correct, might be impor- tant for understanding biological computation (see our Discussion). At the same time, we believe that our evidence for multiple reserve pools within individual synapses with heterogenous properties is compelling. We see no contradiction, and indeed, our conclusion that the ratio of slowly to quickly mobilized varies extensively between synapses can only be correct if individual synapses contain mul- tiple types. We hope that people who are interested in our conclusions will evaluate the evidence and reasoning presented in our report.

      Bulk imaging of FM de-staining does not really measure the fraction of non-stained vesicles, which changes dynamically during stimulation, so that the situation calls for an independent readout of stained and non-stained vesicles. Moreover, direct correspondence between two speci1c stimulation frequencies (with long stimulation) and vesicle pools is not straightforward. These issues make the experimentally measured pools not well-de1ned.

      We think that the reviewer is suggesting an alternative scenario where decreases in the fractional rate of FM-dye de-staining seen during 1 Hz stimulation might be caused by a large (4-fold) increase in the total size of the reserve pool that dilutes the stained vesicles by mixing. This scenario is consis- tent with the results in Figures 2 and 4-7, and initially seems plausible because previous studies have shown that many vesicles are not mobilized, and therefore are not stained, during our standard load- ing protocol of 100 s at 20 Hz (Harata et al., 2001). However, liberation of this "deep reserve" as an explanation for the decrease in fractional destaining is not compatible with the results in Figures 10-11 that rule out mixing. For example, liberation of the deep reserve would cause fractional destaining to appear equally depressed during subsequent 20 Hz stimulation, and Figure 10 shows that this is not the case. The scenario cannot be rescued by postulating that the subsequent 20 Hz stimulation caused the deep reserve to quickly recapture the liberated vesicles because Figure 11D-E shows that fractional de-staining continues to be depressed at the very beginning of a second 1 Hz train that follows the 20 Hz stimulation.

      (2) The authors’ latest round of responses did not alleviate most of my major previous concerns. The additional data now shown in Fig 3 rely on conceptually the same type of bulk measurements and thus suffer from the same limitations as outlined in the earlier review.

      We believe that the new evidence in Figure 3 for multiple reserve pools at individual synapses is strong when evaluated in combination with the results in Figures 4-11. We do not, at present, see how the fact that FM-dye destaining is used as a bulk measurement at the sub-synaptic level could undercut our logic.

      Moreover, the image of neuronal cultures shown in Fig. 3 might be problematic. It shows very bright staining with large round lumps, which may be indicative of unhealthy cultures.

      Unhealthy cultures are not a concern because we used strict quantitative criteria to assess health that are better than we have seen elsewhere (details below). We think the reviewer might be reacting to the way we rendered the image; i.e., as “overexposed”. We did this to highlight the dimmest punctae, which is a key element of the analysis. The same image rendered with less contrast is now displayed in Author response image 1 (3rd panel from left).

      Author response image 1.

      Image to left is a reproduction of the example image in Figure 3, which was the average of 120 time lapse raw data images; scale bar is 20 µm. The second image is a replicate except all 69 punctae that were included in the study are occluded by 1.5 µm × 1.5 µm yellow squares. The third image is another replicate except with a different brightness setting. The rightmost image is one of the raw data images with brightness matched to the third image.

      More details (relevance to in vivo is in point 4):

      (1) Identifying unhealthy cultures is straightforward with our technique because synapses in un- healthy cultures destain spontaneously. Our criteria for accepting experiments for further analy- sis was less than 1.5 % spontaneous rundown/minute. This is a better way to judge health than we have seen elsewhere because it eliminates subjective decisions, and would be equally appli- cable for microscopes and imaging software of any quality. For our part, we used a 25X objective with a low numerical aperture and low intensity illumination that allowed us to completely avoid photobleaching. The images will look worse to some compared to when acquired with a higher quality microscope, but the absence of photobleaching is an important bene1t because it allowed us to avoid complicated corrections.

      (2) Stained areas larger than 1.5 µm across - such as the ones noted by the reviewer - were expressly excluded from our study because they could have been clusters of multiple synapses. The size criteria are detailed in the Legend of Figure 3. Punctae and larger areas that were excluded are the ones that are not occluded by yellow squares in the 2nd image from the left, above; at least two of the largest were likely clusters of synapses that were out of focus. Nevertheless, despite being excluded, it is unlikely that the stained areas larger than 1.5 µm in the image in Figure 3 were characteristic of unhealthy cultures because these areas did not de-stain spontaneously, but instead de-stained in response to 1 and 20 Hz electrical stimulation much like the small punctae that were included in the analysis.

      (3) Electron microscopy results have shown that individual synapses vary >10-fold in size, so a large range of brightness is expected (Murthy et al., 2001). The large range would either make the brighter punctae and clusters appear to be overexposed in a printed image, or render the dimmer punctae invisible. We have opted to present an image with overall brightness adjusted so that the dimmest punctae are visible. This is appropriate because one of the concerns was that analyzing the dimmest punctae would reveal underlying populations where the rate of fractional destaining was constant. In the end, no evidence for underlying populations emerged, which supports the conclusion that the decreases in fractional destaining occur at individual synapses. Note that adjusting brightness for example images was unavoidable; we used the camera in a range that was far below saturation and, because of this, images presented without adjusting brightness would appear to be completely black.

      (4) Primary cell cultures are non-physiological by de1nition, so the concept of health is intrinsically arbitrary, and relevance to synapses in brains is questioned routinely. However, the new 1ndings in the present report are that: (1) individual hippocampal synapses contain multiple reserve pools; (2) the reserves remain separate but are not distinguishable by the timing of mobilization when the frequency of stimulation is high; and (3) the reserves are nevertheless processed in parallel even when the frequency of stimulation is high. Of these, 1nding (1) has been reported previously for other synapse types, but 1ndings (2) and (3) were both unexpected, and 1nding (3) was not compatible with current concepts. Nevertheless, all three 1ndings were predicted by a model that was developed to explain orthogonal results from studies of intact synapses in ex vivo slices that did not 1t with current concepts either, as referenced in the Introduction. Because of this, we think that the parallel processing of quickly and slowly mobilized reserve vesicles likely occurs in individual Schaffer collateral synapses in vivo, and is not a cell culture artifact; the alternative would be too much of an unlikely coincidence.

      References

      Harata N, Pyle JL, Aravanis AM, Mozhayeva M, Kavalali ET & Tsien RW (2001). Limited numbers of recycling vesicles in small CNS nerve terminals: implications for neural signaling and vesicular cycling. Trends in Neurosciences 24, 637–43.

      Murthy VN, Schikorski T, Stevens CF & Zhu Y (2001). Inactivity produces increases in neurotransmitter release and synapse size. Neuron 32, 673–82.

      Waters J & Smith SJ (2002). Vesicle pool partitioning in2uences presynaptic diversity and weighting in rat hippocampal synapses. Journal of Physiology 541, 811–23.


      The following is the authors’ response to the original reviews.

      Reviewer 1

      Mahfooz et al. investigated the time course of synaptic vesicle fusion of cultured mouse hippocampal synapses using FM-styryl dyes. The major finding is that the FM destaining time course deviates from a mono-exponential function during 1 Hz, but not 20 Hz stimulation. The deviation from a mono-exponential function was also seen during a second stimulus train applied after recovery periods of several minutes, or after depletion of the readily-releasable vesicle pool. Furthermore, this "decreased fractional destaining" was unlikely due to long-term synaptic depression, or incomplete dye clearance. Fractional destaining was enhanced when the dye was loaded with 1 Hz compared with 20 Hz stimulation, suggesting that vesicles recycled during 1 Hz stimulation are predominantly sorted into a rapidly mobilized pool. Finally, they show that 20 Hz stimulation does not affect the decrease in fractional destaining induced and recorded during 1 Hz stimulation. Based on these observations, they put forward a model in which slowly and quickly resupplied synaptic vesicles are mobilized in parallel.

      The demonstration that FM destaining time courses deviate from single exponentials during 1 Hz stimulation (Figs 2-3) is a starting point used to rule out simple models where vesicles intermix freely and to introduce a mathematical technique for quantifying the extent of the deviations that is essential for the analysis of later experiments, where curve fitting could not be used. We then:

      1) Show that the deviation from simple models is not caused by depletion of the readily releasable pool, as noted by the reviewer;

      2) rule out a number of explanations for the deviation that do not involve reserve pools at all, again as noted;

      3) provide affirmative evidence for the presence of multiple reserve pools by labeling them with distinct colors;

      4) show that the vesicles within the distinct reserve pools do not intermix even when activity is intense enough to drive destaining with single exponential kinetics.

      We believe that the 4th point - documented in Figs 10-11 - is a key element.

      Beyond that, we note that our working model arose from previous studies, as referenced in the Introduction, not from the present results. The model did predict the parallel processing of quickly and slowly mobilized reserves, and the present study was designed to test this prediction. In that sense, the evidence in the current study supports our working model, not the other way around.

      In any case, most readers in the near term will be more interested in the serial versus parallel question, and less in precisely what the present results mean for evaluating our working model. Because of this, we emphasize that evidence for parallel processing of separate reserve pools depends solely on experimental results within the study, and not on modeling. As a consequence, the evidence will continue to be equally strong even if problems with our working model arise later on (lines 382-386).

      We do have additional unpublished evidence for the working model that does not bear directly on the parallel versus serial question. Some of this was removed from an earlier version of the manuscript and some has been newly gathered since the original submission. We will publish the additional evidence at a later point. We decided not to include it in the present manuscript expressly to avoid confusion about the relationship between modeling and the evidence for parallel processing in general.

      The paper addresses an interesting question - the relationship between the resupply and release of synaptic vesicles. The study is based on a lot of data of high quality. Most data are solid. However, some of the major conclusions are not well supported by the data. Moreover, it remains unclear how speci1c the findings are to the experimental design.

      The following points should be addressed:

      1) Most traces display a decrease in fluorescence intensity before stimulation. Data with a decrease in baseline fluorescence intensity of up to 1.5 % were considered for the analysis (Fig 2-supplement 2). I may have missed it, but were the data corrected for the observed decrease in baseline fluorescence intensity? (In the model shown in Appendix 1 Figure 1, they correct for "rundown"). For instance, are the residuals shown in Fig 2D, E based on corrected data? In case the data would not be corrected for a decrease in baseline fluorescence, would the decay kinetics also deviate from a single exponential after correction?

      We did not correct for rundown - as now noted on lines 96-97 - except in the figure in the Appendix, noted by the reviewer, where the uncorrected and corrected time courses are plotted side by side for easy comparison. However, our study includes an analysis showing that correcting for rundown during 1 Hz stimulation would increase - not decrease - the deviation from a single exponential (2 bars in rightmost panel in Fig 2C, and lines 113-116 of Results), so the absence of a correction does not weaken our conclusions.

      2) The analysis of "fractional destaining" is not clear to me. How many intervals of which length were chosen and why? For instance, the intervals often differ in length, number and do not cover the complete decay (e.g., Fig 2B).

      We calculated fractional destaining from longer intervals at later times because the overall amount of stain was less, meaning signal/noise was less, and scatter was more. We did this because increased scatter at later times could be counteracted by estimating the slope of destaining from longer intervals. An additional bene1t is that elongating the later intervals allowed us to plot only 6 bars for 25 min of 1 Hz destaining, which works better visually than 17.

      Increasing the interval length for later times is mathematically sound because the key factor causing distortions related to deviations from linearity is not the length of the interval per se but, instead, the fractional destaining over the interval. The fractional destaining is greater at the start of 1Hz stimulation, thus requiring shorter intervals.

      It would be possible to choose inappropriately long intervals that would distort estimates of the change in fractional destaining. However, we now include Fig 2-supplement 6 – which includes all 17 1.5 min intervals - to con1rm that any distortions after the first interval were minimal. The Appendix predicts a biologically important distortion for the first interval which we are following up, but this would underestimate the true deviation from quickly mixing pools, so would not be problematic for the present conclusions.

      Sometimes, only the interval right after stimulation onset was considered (e.g., Fig 7, 8).

      Figs 7, 8 in the previous version are now Figs 8, 9.

      This is appropriate because the goal was to estimate the fractional destaining at the very start, before the quickly mobilized fraction has destained.

      How quickly fractional destaining is expected to revert to the lowest value seen after 15 min of 1Hz stimulation in Fig 2 (and elsewhere) depends very much on assumptions - such as the number of reserve pools, etc. We sought to avoid this kind of additional analysis because we are keen to avoid the impression that our main conclusions depend on the speci1cs of modeling.

      How sensitive are the changes in fractional destaining to the choice of the intervals?

      Minimally. This can be seen by eye because the magenta lines in Fig 2B 1t the data well, but see Fig 2-supplement 6 for a quantitative comparison.

      For instance, would fractional destaining be increased if later intervals would have been chosen for the second 20 Hz stimulus in the experiment shown in Fig 9B?

      Previous Fig 9B is now Fig 10B.

      We cannot be certain, but think it probably would not be different. Neither an increase nor a decrease would be problematic for our conclusions.

      More detail: There is not enough data to evaluate this specifically for Fig 10B because the total amount of stain remaining at later intervals is little, meaning signal/noise is low, which causes extensive experimental scatter. However, synapses were even more extensively destained prior to time course c of Figure2-supplement 2C, which nevertheless matches time courses a, b, and d.

      I propose fitting all baseline-corrected data with a single and a double-exponential function (as well as single exponential plus line?) and reporting the corresponding time constants (slopes) and amplitudes.

      As noted above, we purposefully do not baseline correct data in a way that would make this possible. However, we do include exponential fits when appropriate, in Fig 2D-E, Fig 2- supplement 1, Fig 2-supplement-7, Fig 2-supplement-8, and Fig 12B.

      Indeed, the absence of any change in the weighting parameter despite substantial changes for both time constants seen after raising the temperature to 35C (Fig 2-supplement-8 vs Fig12B) is notable because it suggests that the contents of the reserve pools are not altered by changing temperature, even though vesicle trafficking is accelerated. Fig 2-supplement-8 is a supplementary figure because the result is outside the scope of the main point, not because the quality is lower than for other figures.

      Beyond that, exponential fits would not be adequate for most of the study because many experiments - including the core experiments in Figs 10-11 - require discontinuous stimulation, such as when we stop stimulating at 1 Hz, rest for minutes, and then start up again at 1 or 20 Hz. And, although widely used, exponentials are non-linear equations after all. Even when they can be used to quantify time courses, the fractional destaining measurement is almost always more informative, in the technical sense, because it avoids complications when estimating the importance of deviations occurring at the two extremes versus deviations in the middle of the time course.

      3) Along the same lines, is the average slow time constant indeed around 40 min? (Are the data shown in Fig 2 S7 based on an average?) If this would be the case, I suggest conducting a control experiment with a recording time > 40 min. Would fitting an exponential or a line to baseline data (without stimulation) also give a similar slow component?

      Fig 2-supplement 7 in the previous version is now Fig 2-supplement 8.

      First, yes, the time course shown in Fig 2-supplement 8 is the mean across preparations. The time courses of the individual preparations were quanti1ed as the median value of the individual ROIs before averaging.

      Second, no, fitting baseline data would give an approximately 3-fold greater time constant (i.e., 120 min) because fractional destaining decreases by about 3-fold when we stop stimulating after 25 min of 1 Hz stimulation (i.e., Fig 2C, 3B, and many others).

      The key point is that fractional destaining decreases greatly over long trains of 1 Hz stimulation.

      For Fig 2, we saw a 2.7+/-0.1-fold decrease before accounting for baseline destaining (lines 106-110), which increased to a 4.4-fold decrease when we did account for baseline destaining (lines 113-116). Overall, the 2.7-fold value is simultaneously a safe minimum boundary, and much greater than the value of 1.0 expected from models where vesicles mix freely.

      Note that future studies will show that even the 4.4-fold value is probably an underestimate because 1 Hz stimulation misses a fast component at the very beginning of the time courses, as predicted in the Appendix.

      4) How speci1c are the findings to 1 Hz (and 20 Hz) stimulation? From which frequency onward can a decrease in fractional destaining be no longer observed?

      Our logic depends only on the premise that we are able to find some frequency where fractional destaining no longer decreases. We knew that 20 Hz was a good place to start because of previous electrophysiological experiments - frequency jumps (Fig 1 of Wesseling and Lo, 2002 and Fig 2C of Garcia-Perez and Wesseling, 2008), and trains of action potentials followed by osmotic shocks (Fig 2A of Garcia-Perez et al., 2008) - showing that 20 Hz stimulation is enough to nearly completely exhaust the readily releasable pool. This is noted in lines 202-203, and Box 2.

      would previous stimulation with frequencies <20 Hz interfere with fractional destaining? These control experiments would help assessing how general/speci1c the findings are.

      Yes (Figs 4 and 11A at 1 Hz). Also, we have done experiments at 0.1 Hz, which will be published later; some of these were actually removed from an earlier version of the manuscript because the results are primarily relevant to deciding between particular parallel models, and are not relevant to the conclusion of the present study that quickly and slowly mobilized reserves are processed in parallel.

      Similarly, a major conclusion of the paper - the parallel mobilization of two vesicle pools - is largely based on these two stimulation frequencies. Can they exclude that mixing between the two pools occurs at other frequencies?

      We cannot exclude the possibility of breakdown at a higher frequency, but this would not undercut our conclusions. We do not have plans to try this experiment because: (1) a positive result would be open to concerns about non-physiologically heavy stimulation; and (2) a negative result would be difficult to interpret because of the possibility that the axons cannot follow at higher frequencies.

      6) Some information in the methods section is lacking. For instance, which species is the cell culture based on?

      Mice from both sexes were used. This is now speci1ed in the Methods.

      Reviewer 2

      By using optical monitoring of synaptic vesicles with FM1-43 at hippocampal synapses, the authors try to show the evidence for two parallel reserve pools of synaptic vesicles, which feed the vesicles to the readily releasable pool. The major strength of the study is the use of a quantitative model, which can be readily testable by experiments: in the course of the study, the authors propose the best vesicle pool model, which fits the experimental data "averaged over synapses" nicely. On the other hand, the weak point of the study comes from the optical method and the data: bulk imaging of vesicle dynamics monitored at each synapse is noisy and the signals vary considerably among synapses. Therefore, the average signals over many synapses may not reflect the vesicle dynamics of two reserve pools within a synapse, but something else, such as the different kinetics of release from multiple synapses with different release probability. Nevertheless, a new framework of two reserve pools offers a testable hypothesis of vesicle dynamics, and the use of single vesicle tracking and EM may allow one to give a de1nitive answer in the future studies Therefore, the study may be of interest to the community of synaptic neurobiology.

      1) The current version includes a new figure (Fig 3) showing that the deviations from single pool models seen in populations are caused by deviations occurring at the level of single synapses. The heterogeneity between synapses actually causes population statistics to underestimate - not overestimate - the mean and median size of the deviations at individuals.

      We think the new evidence in Fig 3 and supplements is conclusive without follow-on EM of the same punctae given the substantial body of already published EM on similar cultures. Essentially, the only way to explain the results without invoking multiple reserve pools in individual synapses would be to say that individual synapses ALWAYS come in clumps containing multiple types and are NEVER separated from neighbors by more than 1.5 microns - even when the clumps are separated from each other by 5 microns. There is already clear evidence against this.

      2) No new model is proposed here, see the first response to the first reviewer.

      3) We are not aware of alternative hypotheses that could account for our results, so cannot evaluate if single vesicle tracking and EM could add meaningful additional support.

      1) The existence of non-stained vesicles complicates the interpretation of the data. Because the release by 20 Hz and 1 Hz stimulation do not entirely reflect the release from fast and slow vesicle pools. the estimation of non-stained vesicles using synaptopHluorin (+ba1lomycin) and EPSCs would be helpful to examine fraction of non-stained / stained vesicles over time (with stimulation, the ratio may change dynamically, which may bring complications).

      Non-stained vesicles are not a complication, but instead a key element of our logic which is included in the diagrams in Boxes 1 and 2 and Figure 9. That is, quickly and slowly mobilized reserves can be distinguished at 1 Hz precisely because 1 Hz is not intense enough to exhaust the readily releasable pool (Box 2). The corollary is that stained vesicles must be replaced by non-stained vesicles, because otherwise 1 Hz stimulation would exhaust the readily releasable pool. And this is why FM-dyes (plus a beta-cyclodextrin during washing) are ideal for the current questions whereas other techniques, such as electrophysiology or synaptopHluorin imaging are obviously indispensable for other questions, but could not replace the FM-dyes in the current study. This is now noted on lines 86-89.

      We are aware that synaptopHluorin + ba1lomycin could, in principle, accomplish some of the same goals. However, ba1lomycin ended up being toxic when applied for tens of minutes, as it would have to be in our experiments. And, we do not see what critical question is not already answered with strong evidence using FM dyes.

      2) Individual synapses show marked differences in the time course of de-staining, suggesting differences in release probability. The averaging of the whole data may reflect "average" behavior of synapses, but for example, bi-exponential time course may reflect high Pr and low Pr synapses, rather than vesicle recruitment.

      The authors may comment on this issue.

      See newly added Fig 3, and responses above.

      3) Some differences are very small (Fig 10, the same amplitude as bleaching time course), and I am not certain if the observed differences are meaningful, given low signal to noise ratio in each synapse.

      Fig 10 in the previous version is Fig 11 in the current version.

      Even if correct, this would not be problematic because 20 Hz stimulation clearly did not cause fractional destaining to return to the initial value when stimulation was resumed at 1 Hz (compare d and f in Fig 11E). In any case, Figs 2C, 3B, 5B, 7B, and Fig 10-supplement 2A all show that the minimum fractional destaining value during 1 Hz stimulation is about 3-fold greater than during subsequent rest intervals, which is not a small difference. Also, note that Fig 2-supplement 3 shows that photobleaching likely did not play a role.

      Reviewer 3

      Reviewer #3 (Recommendations For The Authors):

      This study attempts to conceptualize the long-standing question of vesicle pool organization in presynaptic terminals. Authors used classical FM dye release experiments to support a hypothesis that rapidly and slowly releasing vesicles are mobilized in parallel without intermixing. This modular model is also supported indirectly by the authors’ recent findings of molecular links that connect a subset of vesicles in linear chains (published elsewhere).

      Our study should be seen as a test of the hypothesis that quickly and slowly mobilized reserves are processed in parallel. The evidence is independent of any modeling, and would continue to be equally strong if our working model turns out to be incorrect (lines 382-386).

      The scope of the original model was limited by a number of caveats. The main concerns included a limited data set measured in bulk from a highly heterogeneous synapse population, and a complex interrelationship between vesicle mobilization and the bulk FM dye de-staining kinetics. The second major limitation was measurements being performed at room temperature, which inhibits or alters a number of critical synaptic processes that are being modeled. This includes the efficiency of exo/endocytosis coupling, vesicle mobility and release site refractory period, which are stimulus- and temperature-dependent, but were not accounted for in the original model.

      The present study contains experiments at body temperature (Fig 12 and Fig 12-supplement 1 in the current version) and analyses of individual synapses (especially Fig 3 in the current version). To our knowledge all results are consistent with everything that is known about the efficiency of exo/endocytosis coupling, vesicle mobility and release site refractory periods.

      The authors made strong efforts to address previous concerns. However, the main conceptual point, i.e. linking the bulk FM dye de-staining kinetics with precise arrangement of vesicle pools, is not well supported and is generally highly problematic because it ignores many additional processes and confounding factors.

      For example, vesicle exchange between neighboring synapses constitutes from 15% to over 50% of total recycling vesicle population, and therefore is a major contributing factor to FM dye loss/redistribution, but is not considered in this study. Additionally, this vesicle exchange process undergoes calcium/activity-dependent changes, contributing to difficulty in interpreting the current experiments comparing FM de-staining at different stimulation frequencies.

      We do not see how exchange of vesicles between synapses could be a problem for our logic, so cannot evaluate this without a more detailed description of the concern. Instead, our results rule out random inter-synaptic exchange between quickly and slowly mobilized reserve pools because this would show up in our assays as mixing, which does not occur. We think there are three remaining possibilities:

      1) vesicles are exchanged primarily between quickly mobilized reserve pools

      2) vesicles are exchanged primarily between slowly mobilized reserve pools

      3) vesicles in quickly mobilized reserve pools are targeted to quickly mobilized reserve pools in other synapses and vesicles in slowly mobilized reserve pools are targeted to slowly mobilized reserve pools in other synapses.

      It would be interesting to know which of these is correct, but this is outside the scope of the current study.

      Moreover, other forms of release, such as asynchronous release, contribute a large fraction of released vesicles, but are not factored in. Asynchronous release varies widely in synapse population from 0.1 to >0.4 of synchronous release, but is entirely ignored. Spontaneous release may also contribute to FM dye loss over extended 25min recordings used.

      Spontaneous release and asynchronous release are not caveats.

      First, spontaneous: We suspect that spontaneous release contributes to the background destaining rate, but this is 3-fold slower than the minimum during 1 Hz stimulation on average (Figs 2C, 3C, 5B etc), so we know that the slowly mobilized reserve is mobilized by low frequency trains of action potentials (lines 410-412). Note that a different outcome - where the rate of destaining decreased to a very low level during long trains of 1 Hz stimulation - would not have been consistent with the idea that slowly mobilized vesicles are only released spontaneously because the remaining fluorescence can always be destained rapidly by increasing the stimulation intensity to 20 Hz (e.g., see examples in Fig 3).

      Second, asynchronous: We know that slowly mobilized reserves must be released synchronously at 35C because the asynchronous component is eliminated at this temperature (Huson et al., 2019), without altering the quantity of slowly mobilized reserves that are mobilized by 1 Hz stimulation (lines 350-360 of Results, and 445-452 of Discussion; we can con1rm from our own unpublished experiments that the disappearance of asynchronous release at 35C is a robust phenomenon in these cell cultures). Asynchronous release of slowly mobilized vesicles might occur at room temperature, but this would not argue against the conclusion that slowly mobilized vesicles are processed in parallel with quickly mobilized.

      Speci1c comments:

      Points 1-4 are already addressed above.

      5) The notion of the chained vesicles is somewhat confusing: how does the "first" vesicle located at the plasma membrane/release site get released if it is attached to the chain? Wouldn’t this "first" vesicle be non-immediately releasable since it must first be liberated? Since all vesicles shown in the Figure 1 have chains attached to them, what vesicle population then give rise to sub-millisecond release?

      This is not a concern relevant to the present study because none of the conclusions rely on the model in any way (see Introduction, and lines 382-386 of the Discussion). Beyond that: We previously published clear evidence that docked vesicles are tethered to non-docked vesicles (Figure 8 of Wesseling et al., 2019). We see no reason to suspect that a tether to an internal vesicle would prevent the docked vesicle from priming for release.

      7) Model: For fitting de-staining during 20 Hz stimulation, authors state that it was necessary to allow >5-fold Facilitation. This seems to be non-physiologically relevant, since previous studies found only very mild facilitation at room temperature (typically below a factor of 1.5-2.0) and the authors themselves state that, at most, a 1.3 fold facilitation was found.

      If the 1.3-fold facilitation estimate comes from us, it must have been in a different context.

      Most estimates of facilitation that are published are heavily convolved with simultaneous depression, and there is additionally a saturation mechanism for readily releasable vesicles with high release probability that is not widely known (Garcia-Perez and Wesseling, 2008). The standard method for eliminating the depression is to lower the probability of release by lowering extracellular [Ca2+], which additionally relieves occlusion by the saturation mechanism. And, lowering [Ca2+] uncovers an enormous amount facilitation at synapses in hippocampal cell culture. For example, see Figure 2B of Stevens and Wesseling (1999), which shows a 7-fold enhancement during 9 Hz stimulation, and Figure 3 of the same study, which shows a linear relationship with frequency. Taken together these two results suggest 15-fold enhancement during 20 Hz stimulation, which far exceeds the 5-fold value needed at inefficient release sites to make our working model 1t the FM-dye destaining results.

      References

      Garcia-Perez E, Lo DC & Wesseling JF (2008). Kinetic isolation of a slowly recovering component of short-term depression during exhaustive use at excitatory hippocampal synapses. Journal of Neurophysiology 100, 781–95.

      Garcia-Perez E & Wesseling JF (2008). Augmentation controls the fast rebound from depression at excitatory hippocampal synapses. Journal of Neurophysiology 99, 1770–86.

      Huson V, van Boven MA, Stuefer A, Verhage M & Cornelisse LN (2019). Synaptotagmin-1 enables frequency coding by suppressing asynchronous release in a temperature dependent manner. Scienti1c reports 9, 11341.

      Stevens CF & Wesseling JF (1999). Augmentation is a potentiation of the exocytotic process. Neuron 22, 139–46.

      Wesseling JF & Lo DC (2002). Limit on the role of activity in controlling the release-ready supply of synaptic vesicles. Journal of Neuroscience 22, 9708–20.

      Wesseling JF, Phan S, Bushong EA, Siksou L, Marty S, Pérez-Otaño I & Ellisman M (2019). Sparse force-bearing bridges between neighboring synaptic vesicles. Brain Structure and Function 224, 3263–3276.

    1. Author Response

      The following is the authors’ response to the original reviews.

      Recommendations

      Recommendation #1: Address potential confounds in the experimental design:

      (1a) Confounding factors between baseline to early learning. While the visual display of the curved line remains constant, there are at least three changes between these two phases: 1) the presence of reward feedback (the focus of the paper); 2) a perturbation introduced to draw a hidden, mirror-symmetric curved line; 3) instructions provided to use reward feedback to trace the line on the screen (intentionally deceitful). As such, it remains unclear which of these factors are driving the changes in both behavior and bold signals between the two phases. The absence of a veridical feedback phase in which participants received reward feedback associated with the shown trajectory seems like a major limitation.

      (1b) Confounding Factors Between Early and Late Learning. While the authors have focused on interpreting changes from early to late due to the explore-exploit trade-off, there are three additional factors possibly at play: 1) increasing fatigue, 2) withdrawal of attention, specifically related to individuals who have either successfully learned the perturbation within the first few trials or those who have simply given up, or 3) increasing awareness of the perturbation (not clear if subjective reports about perturbation awareness were measured.). I understand that fMRI research is resource-intensive; however, it is not clear how to rule out these alternatives with their existing data without additional control groups. [Another reviewer added the following: Why did the authors not acquire data during a control condition? How can we be confident that the neural dynamics observed are not due to the simple passage of time? Or if these effects are due to the task, what drives them? The reward component, the movement execution, increased automaticity?]

      We have opted to address both of these points above within a single reply, as together they suggest potential confounding factors across the three phases of the task. We would agree that, if the results of our pairwise comparisons (e.g., Early > Baseline or Late > Early) were considered in isolation from one another, then these critiques of the study would be problematic. However, when considering the pattern of effects across the three task phases, we believe most of these critiques can be dismissed. Below, we first describe our results in this context, and then discuss how they address the reviewers’ various critiques.

      Recall that from Baseline to Early learning, we observe an expansion of several cortical areas (e.g., core regions in the DMN) along the manifold (red areas in Fig. 4A, see manifold shifts in Fig. 4C) that subsequently exhibit contraction during Early to Late learning (blue areas in Fig. 4B, see manifold shifts in Fig. 4D). We show this overlap in brain areas in Author response image 1 below, panel A. Notably, several of these brain areas appear to contract back to their original, Baseline locations along the manifold during Late learning (compare Fig. 4C and D). This is evidenced by the fact that many of these same regions (e.g., DMN regions, in Author response image 1 panel A below) fail to show a significant difference between the Baseline and Late learning epochs (see Author response image 1 panel B below, which is taken from supplementary Fig 6). That is, the regions that show significant expansion and subsequent contraction (in Author response image 1 panel A below) tend not to overlap with the regions that significantly changed over the time course of the task (in Author response image 1 panel B below).

      Author response image 1.

      Note that this basic observation above is not only true of our regional manifold eccentricity data, but also in the underlying functional connectivity data associated with individual brain regions. To make this second point clearer, we have modified and annotated our Fig. 5 and included it below. Note the reversal in seed-based functional connectivity from Baseline to Early learning (leftmost brain plots) compared to Early to Late learning (rightmost brain plots). That is, it is generally the case that for each seed-region (A-C) the areas that increase in seed-connectivity with the seed region (in red; leftmost plot) are also the areas that decrease in seed-connectivity with the seed region (in blue; rightmost plot), and vice versa. [Also note that these connectivity reversals are conveyed through the eccentricity data — the horizontal red line in the rightmost plots denote the mean eccentricity of these brain regions during the Baseline phase, helping to highlight the fact that the eccentricity of the Late learning phase reverses back towards this Baseline level].

      Author response image 2.

      Critically, these reversals in brain connectivity noted above directly counter several of the critiques noted by the reviewers. For instance, this reversal pattern of effects argues against the idea that our results during Early Learning can be simply explained due to the (i) presence of reward feedback, (ii) presence of the perturbation or (iii) instructions to use reward feedback to trace the path on the screen. Indeed, all of these factors are also present during Late learning, and yet many of the patterns of brain activity during this time period revert back to the Baseline patterns of connectivity, where these factors are absent. Similarly, this reversal pattern strongly refutes the idea that the effects are simply due to the passage of time, increasing fatigue, or general awareness of the perturbation. Indeed, if any of these factors alone could explain the data, then we would have expected a gradual increase (or decrease) in eccentricity and connectivity from Baseline to Early to Late learning, which we do not observe. We believe these are all important points when interpreting the data, but which we failed to mention in our original manuscript when discussing our findings.

      We have now rectified this in the revised paper, where we now write in our Discussion:

      “Finally, it is important to note that the reversal pattern of effects noted above suggests that our findings during learning cannot be simply attributed to the introduction of reward feedback and/or the perturbation during Early learning, as both of these task-related features are also present during Late learning. In addition, these results cannot be simply explained due to the passage of time or increasing subject fatigue, as this would predict a consistent directional change in eccentricity across the Baseline, Early and Late learning epochs.”

      However, having said the above, we acknowledge that one potential factor that our findings cannot exclude is that they are (at least partially) attributable to changes in subjects’ state of attention throughout the task. Indeed, one can certainly argue that Baseline trials in our study don’t require a great deal of attention (after all, subjects are simply tracing a curved path presented on the screen). Likewise, for subjects that have learned the hidden shape, the Late learning trials are also likely to require limited attentional resources (indeed, many subjects at this point are simply producing the same shape trial after trial). Consequently, the large shift in brain connectivity that we observe from Baseline to Early Learning, and the subsequent reversion back to Baseline-levels of connectivity during Late learning, could actually reflect a heightened allocation of attention as subjects are attempting to learn the (hidden) rewarded shape. However, we do not believe that this would reflect a ‘confound’ of our study per se — indeed, any subject who has participated in a motor learning study would agree that the early learning phase of a task is far more cognitively demanding than Baseline trials and Late learning trials. As such, it is difficult to disentangle this ‘attention’ factor from the learning process itself (and in fact, it is likely central to it).

      Of course, one could have designed a ‘control’ task in which subjects must direct their attention to something other than the learning task itself (e.g., divided attention paradigm, e.g., Taylor & Thoroughman, 2007, 2008, and/or perform a secondary task concurrently (Codol et al., 2018; Holland et al., 2018), but we know that this type of manipulation impairs the learning process itself. Thus, in such a case, it wouldn’t be obvious to the experimenter what they are actually measuring in brain activity during such a task. And, to extend this argument even further, it is true that any sort of brain-based modulation can be argued to reflect some ‘attentional’ process, rather than modulations related to the specific task-based process under consideration (in our case, motor learning). In this regard, we are sympathetic to the views of Richard Andersen and colleagues who have eloquently stated that “The study of how attention interacts with other neural processing systems is a most important endeavor. However, we think that over-generalizing attention to encompass a large variety of different neural processes weakens the concept and undercuts the ability to develop a robust understanding of other cognitive functions.” (Andersen & Cui, 2007, Neuron). In short, it appears that different fields/researchers have alternate views on the usefulness of attention as an explanatory construct (see also articles from Hommel et al., 2019, “No one knows what attention is”, and Wu, 2023, “We know what attention is!”), and we personally don’t have a dog in this fight. We only highlight these issues to draw attention (no pun intended) that it is not trivial to separate these different neural processes during a motor learning study.

      Nevertheless, we do believe these are important points worth flagging for the reader in our paper, as they might have similar questions. To this end, we have now included in our Discussion section the following text:

      “It is also possible that some of these task-related shifts in connectivity relate to shifts in task-general processes, such as changes in the allocation of attentional resources (Bédard and Song, 2013; Rosenberg et al., 2016) or overall cognitive engagement (Aben et al., 2020), which themselves play critical roles in shaping learning (Codol et al., 2018; Holland et al., 2018; Song, 2019; Taylor and Thoroughman, 2008, 2007; for a review of these topics, see Tsay et al., 2023). Such processes are particularly important during the earlier phases of learning when sensorimotor contingencies need to be established. While these remain questions for future work, our data nevertheless suggest that this shift in connectivity may be enabled through the PMC.”

      Finally, we should note that, at the end of testing, we did not assess participants' awareness of the manipulation (i.e., that they were, in fact, being rewarded based on a mirror image path). In hindsight, this would have been a good idea and provided some value to the current project. Nevertheless, it seems clear that, based on several of the learning profiles observed (e.g., subjects who exhibited very rapid learning during the Early Learning phase, more on this below), that many individuals became aware of a shape approximating the rewarded path. Note that we have included new figures (see our responses below) that give a better example of what fast versus slower learning looks like. In addition, we now note in our Methods that we did not probe participants about their subjective awareness re: the perturbation:

      “Note that, at the end of testing, we did not assess participants’ awareness of the manipulation (i.e., that they were, in fact, being rewarded based on a mirror image path of the visible path).”

      Recommendation #2: Provide more behavioral quantification.

      (2a) The authors chose to only plot the average learning score in Figure 1D, without an indication of movement variability. I think this is quite important, to give the reader an impression of how variable the movements were at baseline, during early learning, and over the course of learning. There is evidence that baseline variability influences the 'detectability' of imposed rotations (in the case of adaptation learning), which could be relevant here. Shading the plots by movement variability would also be important to see if there was some refinement of the moment after participants performed at the ceiling (which seems to be the case ~ after trial 150). This is especially worrying given that in Fig 6A there is a clear indication that there is a large difference between subjects' solutions on the task. One subject exhibits almost a one-shot learning curve (reaching a score of 75 after one or two trials), whereas others don't seem to really learn until the near end. What does this between-subject variability mean for the authors' hypothesized neural processes?

      In line with these recommendations, we have now provided much better behavioral quantification of subject-level performance in both the main manuscript and supplementary material. For instance, in a new supplemental Figure 1 (shown below), we now include mean subject (+/- SE) reaction times (RTs), movement times (MTs) and movement path variability (our computing of these measures are now defined in our Methods section).

      As can be seen in the figure, all three of these variables tended to decrease over the course of the study, though we note there was a noticeable uptick in both RTs and MTs from the Baseline to Early learning phase, once subjects started receiving trial-by-trial reward feedback based on their movements. With respect to path variability, it is not obvious that there was a significant refinement of the paths created during late learning (panel D below), though there was certainly a general trend for path variability to decrease over learning.

      Author response image 3.

      Behavioral measures of learning across the task. (A-D) shows average participant reward scores (A), reaction times (B), movement times (C) and path variability (D) over the course of the task. In each plot, the black line denotes the mean across participants and the gray banding denotes +/- 1 SEM. The three equal-length task epochs for subsequent neural analyses are indicated by the gray shaded boxes.

      In addition to these above results, we have also created a new Figure 6 in the main manuscript, which now solely focuses on individual differences in subject learning (see below). Hopefully, this figure clarifies key features of the task and its reward structure, and also depicts (in movement trajectory space) what fast versus slow learning looks like in the task. Specifically, we believe that this figure now clearly delineates for the reader the mapping between movement trajectory and the reward score feedback presented to participants, which appeared to be a source of confusion based on the reviewers’ comments below. As can be clearly observed in this figure, trajectories that approximated the ‘visible path’ (black line) resulted in fairly mediocre scores (see score color legend at right), whereas trajectories that approximated the ‘reward path’ (dashed black line, see trials 191-200 of the fast learner) resulted in fairly high scores. This figure also more clearly delineates how fPCA loadings derived from our functional data analysis were used to derive subject-level learning scores (panel C).

      Author response image 4.

      Individual differences in subject learning performance. (A) Examples of a good learner (bordered in green) and poor learner (bordered in red). (B) Individual subject learning curves for the task. Solid black line denotes the mean across all subjects whereas light gray lines denote individual participants. The green and red traces denote the learning curves for the example good and poor learners denoted in A. (C) Derivation of subject learning scores. We performed functional principal component analysis (fPCA) on subjects’ learning curves in order to identify the dominant patterns of variability during learning. The top component, which encodes overall learning, explained the majority of the observed variance (~75%). The green and red bands denote the effect of positive and negative component scores, respectively, relative to mean performance. Thus, subjects who learned more quickly than average have a higher loading (in green) on this ‘Learning score’ component than subjects who learned more slowly (in red) than average. The plot at right denotes the loading for each participant (open circles) onto this Learning score component.

      The reviewers note that there are large individual differences in learning performance across the task. This was clearly our hope when designing the reward structure of this task, as it would allow us to further investigate the neural correlates of these individual differences (indeed, during pilot testing, we sought out a reward structure to the task that would allow for these intersubject differences). The subjects who learn early during the task end up having higher fPCA scores than the subjects who learn more gradually (or learn the task late). From our perspective, these differences are a feature, and not a bug, and they do not negate any of our original interpretations. That is, subjects who learn earlier on average tend to contract their DAN-A network during the early learning phase whereas subjects who learn more slowly on average (or learn late) instead tend to contract their DAN-A network during late learning (Fig. 7).

      (2b) In the methods, the authors stated that they scaled the score such that even a perfectly traced visible path would always result in an imperfect score of 40 patients. What happens if a subject scores perfectly on the first try (which seemed to have happened for the green highlighted subject in Fig 6A), but is then permanently confronted with a score of 40 or below? Wouldn't this result in an error-clamp-like (error-based motor adaptation) design for this subject and all other high performers, which would vastly differ from the task demands for the other subjects? How did the authors factor in the wide between-subject variability?

      We think the reviewers may have misinterpreted the reward structure of the task, and we apologize for not being clearer in our descriptions. The reward score that subjects received after each trial was based on how well they traced the mirror-image of the visible path. However, all the participant can see on the screen is the visible path. We hope that our inclusion of the new Figure 6 (shown above) makes the reward structure of the task, and its relationship to movement trajectories, much clearer. We should also note that, even for the highest performing subject (denoted in Fig. 6), it still required approximately 20 trials for them to reach asymptote performance.

      (2c) The study would benefit from a more detailed description of participants' behavioral performance during the task. Specifically, it is crucial to understand how participants' motor skills evolve over time. Information on changes in movement speed, accuracy, and other relevant behavioral metrics would enhance the understanding of the relationship between behavior and brain activity during the learning process. Additionally, please clarify whether the display on the screen was presented continuously throughout the entire trial or only during active movement periods. Differences in display duration could potentially impact the observed differences in brain activity during learning.

      We hope that with our inclusion of the new Supplementary Figure 1 (shown above) this addresses the reviewers’ recommendation. Generally, we find that RTs, MTs and path variability all decrease over the course of the task. We think this relates to the early learning phase being more attentionally demanding and requiring more conscious effort, than the later learning phases.

      Also, yes, the visible path was displayed on the screen continuously throughout the trial, and only disappeared at the 4.5 second mark of each trial (when the screen was blanked and the data was saved off for 1.5 seconds prior to commencement of the next trial; 6 seconds total per trial). Thus, there were no differences in display duration across trials and phases of the task. We have now clarified this in the Methods section, where we now write the following:

      “When the cursor reached the target distance, the target changed color from red to green to indicate that the trial was completed. Importantly, other than this color change in the distance marker, the visible curved path remained constant and participants never received any feedback about the position of their cursor.”

      (2d) It is unclear from plots 6A, 6B, and 1D how the scale of the behavioral data matches with the scaling of the scores. Are these the 'real' scores, meaning 100 on the y-axis would be equivalent to 40 in the task? Why then do all subjects reach an asymptote at 75? Or is 75 equivalent to 40 and the axis labels are wrong?

      As indicated above, we clearly did a poor job of describing the reward structure of our task in our original paper, and we now hope that our inclusion of Figure 6 makes things clear. A ‘40’ score on the y-axis would indicate that a subject has perfectly traced the visible path whereas a perfect ‘100’ score would indicate that a subject has perfectly traced the (hidden) mirror image path.

      The fact that several of the subjects reach asymptote around 75 is likely a byproduct of two factors. Firstly, the subjects performed their movements in the absence of any visual error feedback (they could not see the position of a cursor that represented their hand position), which had the effect of increasing motor variability in their actions from trial to trial. Secondly, there appears to be an underestimation among subjects regarding the curvature of the concealed, mirror-image path (i.e., that the rewarded path actually had an equal but opposite curvature to that of the visible path). This is particularly evident in the case of the top-performing subject (illustrated in Figure 6A) who, even during late learning, failed to produce a completely arched movement.

      (2e) Labeling of Contrasts: There is a consistent issue with the labeling of contrasts in the presented figures, causing confusion. While the text refers to the difference as "baseline to early learning," the label used in figures, such as Figure 4, reads "baseline > early." It is essential to clarify whether the presented contrast is indeed "baseline > early" or "early > baseline" to avoid any misinterpretation.

      We thank the reviewers for catching this error. Indeed, the intended label was Early > Baseline, and this has now been corrected throughout.

      Recommendation #3. Clarify which motor learning mechanism(s) are at play.

      (3a) Participants were performing at a relatively low level, achieving around 50-60 points by the end of learning. This outcome may not be that surprising, given that reward-based learning might have a substantial explicit component and may also heavily depend on reasoning processes, beyond reinforcement learning or contextual recall (Holland et al., 2018; Tsay et al., 2023). Even within our own data, where explicit processes are isolated, average performance is low and many individuals fail to learn (Brudner et al., 2016; Tsay et al., 2022). Given this, many participants in the current study may have simply given up. A potential indicator of giving up could be a subset of participants moving straight ahead in a rote manner (a heuristic to gain moderate points). Consequently, alterations in brain networks may not reflect exploration and exploitation strategies but instead indicate levels of engagement and disengagement. Could the authors plot the average trajectory and the average curvature changes throughout learning? Are individuals indeed defaulting to moving straight ahead in learning, corresponding to an average of 50-60 points? If so, the interpretation of brain activity may need to be tempered.

      We can do one better, and actually give you a sense of the learning trajectories for every subject over time. In the figure below, which we now include as Supplementary Figure 2 in our revision, we have plotted, for each subject, a subset of their movement trajectories across learning trials (every 10 trials). As can be seen in the diversity of these trajectories, the average trajectory and average curvature would do a fairly poor job of describing the pattern of learning-related changes across subjects. Moreover, it is not obvious from looking at these plots the extent to which poor learning subjects (i.e., subjects who never converge on the reward path) actually ‘give up’ in the task — rather, many of these subjects still show some modulation (albeit minor) of their movement trajectories in the later trials (see the purple and pink traces). As an aside, we are also not entirely convinced that straight ahead movements, which we don’t find many of in our dataset, can be taken as direct evidence that the subject has given up.

      Author response image 5

      Variability in learning across subjects. Plots show representative trajectory data from each subject (n=36) over the course of the 200 learning trials. Coloured traces show individual trials over time (each trace is separated by ten trials, e.g., trial 1, 10, 20, 30, etc.) to give a sense of the trajectory changes throughout the task (20 trials in total are shown for each subject).

      We should also note that we are not entirely opposed to the idea of describing aspects of our findings in terms of subject engagement versus disengagement over time, as such processes are related at some level to exploration (i.e., cognitive engagement in finding the best solution) and exploitation (i.e., cognitively disengaging and automating one’s behavior). As noted in our reply to Recommendation #1 above, we now give some consideration of these explanations in our Discussion section, where we now write:

      “It is also possible that these task-related shifts in connectivity relates to shifts in task-general processes, such as changes in the allocation of attentional resources (Bédard and Song, 2013; Rosenberg et al., 2016) or overall cognitive engagement (Aben et al., 2020), which themselves play critical roles in shaping learning (Codol et al., 2018; Holland et al., 2018; Song, 2019; Taylor and Thoroughman, 2008, 2007; for a review of these topics, see Tsay et al., 2023). Such processes are particularly important during the earlier phases of learning when sensorimotor contingencies need to be established. While these remain questions for future work, our data nevertheless suggest that this shift in connectivity may be enabled through the PMC.”

      (3b) The authors are mixing two commonly used paradigms, reward-based learning, and motor adaptation, but provide no discussion of the different learning processes at play here. Which processes were they attempting to probe? Making this explicit would help the reader understand which brain regions should be implicated based on previous literature. As it stands, the task is hard to interpret. Relatedly, there is a wealth of literature on explicit vs implicit learning mechanisms in adaptation tasks now. Given that the authors are specifically looking at brain structures in the cerebral cortex that are commonly associated with explicit and strategic learning rather than implicit adaptation, how do the authors relate their findings to this literature? Are the learning processes probed in the task more explicit, more implicit, or is there a change in strategy usage over time? Did the authors acquire data on strategies used by the participants to solve the task? How does the baseline variability come into play here?

      As noted in our paper, our task was directly inspired by the reward-based motor learning tasks developed by Dam et al., 2013 (Plos One) and Wu et al., 2014 (Nature Neuroscience). What drew us to these tasks is that they allowed us to study the neural bases of reward-based learning mechanisms in the absence of subjects also being able to exploit error-based mechanisms to achieve learning. Indeed, when first describing the task in the Results section of our paper we wrote the following:

      “Importantly, because subjects received no visual feedback about their actual finger trajectory and could not see their own hand, they could only use the score feedback — and thus only reward-based learning mechanisms — to modify their movements from one trial to the next (Dam et al., 2013; Wu et al., 2014).”

      If the reviewers are referring to ‘motor adaptation’ in the context in which that terminology is commonly used — i.e., the use of sensory prediction errors to support error-based learning — then we would argue that motor adaptation is not a feature of the current study. It is true that in our study subjects learn to ‘adapt’ their movements across trials, but this shaping of the movement trajectories must be supported through reinforcement learning mechanisms (and, of course, supplemented by the use of cognitive strategies as discussed in the nice review by Tsay et al., 2023). We apologize for not being clearer in our paper about this key distinction and we have now included new text in the introduction to our Results to directly address this:

      “Importantly, because subjects received no visual feedback about their actual finger trajectory and could not see their own hand, they could only use the score feedback — and thus only reward-based learning mechanisms — to modify their movements from one trial to the next (Dam et al., 2013; Wu et al., 2014). That is, subjects could not use error-based learning mechanisms to achieve learning in our study, as this form of learning requires sensory errors that convey both the change in direction and magnitude needed to correct the movement.”

      With this issue aside, we are well aware of the established framework for thinking about sensorimotor adaptation as being composed of a combination of explicit and implicit components (indeed, this has been a central feature of several of our other recent neuroimaging studies that have explored visuomotor rotation learning, e.g., Gale et al., 2022 PNAS, Areshenkoff et al., 2022 elife, Standage et al., 2023 Cerebral Cortex). However, there has been comparably little work done on these parallel components within the domain of reinforcement learning tasks (though see Codol et al., 2018; Holland et al., 2018, van Mastrigt et al., 2023; see also the Tsay et al., 2023 review), and as far as we can tell, nothing has been done to date in the reward-based motor learning area using fMRI. By design, we avoided using descriptors of ‘explicit’ or ‘implicit’ in our study because our experimental paradigm did not allow a separate measurement of those two components to learning during the task. Nevertheless, it seems clear to us from examining the subjects’ learning curves (see supplementary figure 2 above), that individuals who learn very quickly are using strategic processes (such as action exploration to identify the best path) to enhance their learning. As we noted in an above response, we did not query subjects after the fact about their strategy use, which admittedly was a missed opportunity on our part.

      Author response image 6.

      With respect to the comment on baseline variability and its relationship to performance, this is an interesting idea and one that was explored in the Wu et al., 2014 Nature Neuroscience paper. Prompted by the reviewers, we have now explored this idea in the current data set by testing for a relationship between movement path variability during baseline trials (all 70 baseline trials, see Supplementary Figure 1D above for reference) and subjects’ fPCA score on our learning task. However, when we performed this analysis, we did not observe a significant positive relationship between baseline variability and subject performance. Rather, we actually found a trend towards a negative relationship (though this was non-significant; r=-0.2916, p=0.0844). Admittedly, we are not sure what conclusions can be drawn from this analysis, and in any case, we believe it to be tangential to our main results. We provide the results (at right) for the reviewers if they are interested. This may be an interesting avenue for exploration in future work.

      Recommendation #4: Provide stronger justification for brain imaging methods.

      (4a) Observing how brain activity varies across these different networks is remarkable, especially how sensorimotor regions separate and then contract with other, more cognitive areas. However, does the signal-to-noise ratio in each area/network influence manifold eccentricity and limit the possible changes in eccentricity during learning? Specifically, if a region has a low signal-to-noise ratio, it might exhibit minimal changes during learning (a phenomenon perhaps relevant to null manifold changes in the striatum due to low signal-to-noise); conversely, regions with higher signal-to-noise (e.g., motor cortex in this sensorimotor task) might exhibit changes more easily detected. As such, it is unclear how to interpret manifold changes without considering an area/network's signal-to-noise ratio.

      We appreciate where these concerns are coming from. First, we should note that the timeseries data used in our analysis were z-transformed (mean zero, 1 std) to allow normalization of the signal both over time and across regions (and thus mitigate the possibility that the changes observed could simply reflect mean overall signal changes across different regions). Nevertheless, differences in signal intensity across brain regions — particularly between cortex and striatum — are well-known, though it is not obvious how these differences may manifest in terms of a task-based modulation of MR signals.

      To examine this issue in the current data set, we extracted, for each subject and time epoch (Baseline, Early and Late learning) the raw scanner data (in MR arbitrary units, a.u.) for the cortical and striatal regions and computed the (1) mean signal intensity, (2) standard deviation of the signal (Std) and (3) temporal signal to noise ratio (tSNR; calculated by mean/Std). Note that in the fMRI connectivity literature tSNR is often the preferred SNR measure as it normalizes the mean signal based on the signal’s variability over time, thus providing a general measure of overall ‘signal quality’. The results of this analysis, averaged across subjects and regions, is shown below.

      Author response image 7.

      Note that, as expected, the overall signal intensity (left plot) of cortex is higher than in the striatum, reflecting the closer proximity of cortex to the receiver coils in the MR head coil. In fact, the signal intensity in cortex is approximately 38% higher than that in the striatum (~625 - 450)/450). However, the signal variation in cortex is also greater than striatum (middle plot), but in this case approximately 100% greater (i.e., (~5 - 2.5)/2.5)). The result of this is that the tSNR (mean/std) for our data set and the ROI parcellations we used is actually greater in the striatum than in cortex (right plot). Thus, all else being equal, there seems to have been sufficient tSNR in the striatum for us to have detected motor-learning related effects. As such, we suspect the null effects for the striatum in our study actually stem from two sources.

      The first likely source is the relatively lower number of striatal regions (12) as compared to cortical regions (998) used in our analysis, coupled with our use of PCA on these data (which, by design, identifies the largest sources of variation in connectivity). In future studies, this unbalance could be rectified by using finer parcellations of the striatum (even down to the voxel level) while keeping the same parcellation of cortex (i.e., equate the number of ‘regions’ in each of striatum and cortex). The second likely source is our use of a striatal atlas (the Harvard-Oxford atlas) that divides brain regions based on their neuroanatomy rather than their function. In future work, we plan on addressing this latter concern by using finer, more functionally relevant parcellations of striatum (such as in Tian et al., 2020, Nature Neuroscience). Note that we sought to capture these interrelated possible explanations in our Discussion section, where we wrote the following:

      “While we identified several changes in the cortical manifold that are associated with reward-based motor learning, it is noteworthy that we did not observe any significant changes in manifold eccentricity within the striatum. While clearly the evidence indicates that this region plays a key role in reward-guided behavior (Averbeck and O’Doherty, 2022; O’Doherty et al., 2017), there are several possible reasons why our manifold approach did not identify this collection of brain areas. First, the relatively small size of the striatum may mean that our analysis approach was too coarse to identify changes in the connectivity of this region. Though we used a 3T scanner and employed a widely-used parcellation scheme that divided the striatum into its constituent anatomical regions (e.g., hippocampus, caudate, etc.), both of these approaches may have obscured important differences in connectivity that exist within each of these regions. For example, areas such the hippocampus and caudate are not homogenous areas but themselves exhibit gradients of connectivity (e.g., head versus tail) that can only be revealed at the voxel level (Tian et al., 2020; Vos de Wael et al., 2021). Second, while our dimension reduction approach, by design, aims to identify gradients of functional connectivity that account for the largest amounts of variance, the limited number of striatal regions (as compared to cortex) necessitates that their contribution to the total whole-brain variance is relatively small. Consistent with this perspective, we found that the low-dimensional manifold architecture in cortex did not strongly depend on whether or not striatal regions were included in the analysis (see Supplementary Fig. 6). As such, selective changes in the patterns of functional connectivity at the level of the striatum may be obscured using our cortex x striatum dimension reduction approach. Future work can help address some of these limitations by using both finer parcellations of striatal cortex (perhaps even down to the voxel level)(Tian et al., 2020) and by focusing specifically on changes in the interactions between the striatum and cortex during learning. The latter can be accomplished by selectively performing dimension reduction on the slice of the functional connectivity matrix that corresponds to functional coupling between striatum and cortex.”

      (4b) Could the authors clarify how activity in the dorsal attention network (DAN) changes throughout learning, and how these changes also relate to individual differences in learning performance? Specifically, on average, the DAN seems to expand early and contract late, relative to the baseline. This is interpreted to signify that the DAN exhibits lesser connectivity followed by greater connectivity with other brain regions. However, in terms of how these changes relate to behavior, participants who go against the average trend (DAN exhibits more contraction early in learning, and expansion from early to late) seem to exhibit better learning performance. This finding is quite puzzling. Does this mean that the average trend of expansion and contraction is not facilitative, but rather detrimental, to learning? [Another reviewer added: The authors do not state any explicit hypotheses, but only establish that DMN coordinates activity among several regions. What predictions can we derive from this? What are the authors looking for in the data? The work seems more descriptive than hypothesis-driven. This is fine but should be clarified in the introduction.]

      These are good questions, and we are glad the reviewers appreciated the subtlety here. The reviewers are indeed correct that the relationship of the DAN-A network to behavioral performance appears to go against the grain of the group-level results that we found for the entire DAN network (which we note is composed of both the DAN-A and DAN-B networks). That is, subjects who exhibited greater contraction from Baseline to Early learning and likewise, greater expansion from Early to Late learning, tended to perform better in the task (according to our fPCA scores). However, on this point it is worth noting that it was mainly the DAN-B network which exhibited group-level expansion from Baseline to Early Learning whereas the DAN-A network exhibited negligible expansion. This can be seen in Author response image 8 below, which shows the pattern of expansion and contraction (as in Fig. 4), but instead broken down into the 17-network parcellation. The red asterisk denotes the expansion from Baseline to Early learning for the DAN-B network, which is much greater than that observed for the DAN-A network (which is basically around the zero difference line).

      Author response image 8.

      Thus, it appears that the DAN-A and DAN-B networks are modulated to a different extent during the task, which likely contributes to the perceived discrepancy between the group-level effects (reported using the 7-network parcellation) and the individual differences effects (reported using the finer 17-network parcellation). Based on the reviewers’ comments, this seems like an important distinction to clarify in the manuscript, and we have now described this nuance in our Results section where we now write:

      “...Using this permutation testing approach, we found that it was only the change in eccentricity of the DAN-A network that correlated with Learning score (see Fig. 7C), such that the more the DAN-A network decreased in eccentricity from Baseline to Early learning (i.e., contracted along the manifold), the better subjects performed at the task (see Fig. 7C, scatterplot at right). Consistent with the notion that changes in the eccentricity of the DAN-A network are linked to learning performance, we also found the inverse pattern of effects during Late learning, whereby the more that this same network increased in eccentricity from Early to Late learning (i.e., expanded along the manifold), the better subjects performed at the task (Fig. 7D). We should note that this pattern of performance effects for the DAN-A — i.e., greater contraction during Early learning and greater expansion during Late learning being associated with better learning — appears at odds with the group-level effects described in Fig. 4A and B, where we generally find the opposite pattern for the entire DAN network (composed of the DAN-A and DAN-B subnetworks). However, this potential discrepancy can be explained when examining the changes in eccentricity using the 17-network parcellation (see Supplementary Figure 8). At this higher resolution level we find that these group-level effects for the entire DAN network are being largely driven by eccentricity changes in the DAN-B network (areas in anterior superior parietal cortex and premotor cortex), and not by mean changes in the DAN-A network. By contrast, our present results suggest that it is the contraction and expansion of areas of the DAN-A network (and not DAN-B network) that are selectively associated with differences in subject learning performance.”

      Finally, re: the reviewers’ comments that we do not state any explicit hypotheses etc., we acknowledge that, beyond our general hypothesis stated at the outset about the DMN being involved in reward-based motor learning, our study is quite descriptive and exploratory in nature. Such little work has been done in this research area (i.e., using manifold learning approaches to study motor learning with fMRI) that it would be disingenuous to have any stronger hypotheses than those stated in our Introduction. Thus, to make the exploratory nature of our study clear to the reader, we have added the following text (in red) to our Introduction:

      “Here we applied this manifold approach to explore how brain activity across widely distributed cortical and striatal systems is coordinated during reward-based motor learning. We were particularly interested in characterizing how connectivity between regions within the DMN and the rest of the brain changes as participants shift from learning the relationship between motor commands and reward feedback, during early learning, to subsequently using this information, during late learning. We were also interested in exploring whether learning-dependent changes in manifold structure relate to variation in subject motor performance.”

      We hope these changes now make it obvious the intention of our study.

      (4c) The paper examines a type of motor adaptation task with a reward-based learning component. This, to me, strongly implicates the cerebellum, given that it has a long-established crucial role in adaptation and has recently been implicated in reward-based learning (see work by Wagner & Galea). Why is there no mention of the cerebellum and why it was left out of this study? Especially given that the authors state in the abstract they examine cortical and subcortical structures. It's evident from the methods that the authors did not acquire data from the cerebellum or had too small a FOV to fully cover it (34 slices at 4 mm thickness 136 mm which is likely a bit short to fully cover the cerebellum in many participants). What was the rationale behind this methodological choice? It would be good to clarify this for the reader. Related to this, the authors need to rephrase their statements on 'whole-brain' connectivity matrices or analyses - it is not whole-brain when it excludes the cerebellum.

      As we noted above, we do not believe this task to be a motor adaptation task, in the sense that subjects are not able to use sensory prediction errors (and thus error-based learning mechanisms) to improve their performance. Rather, by denying subjects this sensory error feedback they are only able to use reinforcement learning processes, along with cognitive strategies (nicely covered in Tsay et al., 2023), to improve performance. Nevertheless, we recognize that the cerebellum has been increasingly implicated in facets of reward-based learning, particularly within the rodent domain (e.g., Wagner et al., 2017; Heffley et al., 2018; Kostadinov et al., 2019, etc.). In our study, we did indeed collect data from the cerebellum but did not include it in our original analyses, as we wanted (1) the current paper to build on prior work in the human and macaque reward-learning domain (which focuses solely on striatum and cortex, and which rarely discusses cerebellum, see Averbeck & O’Doherty, 2022 & Klein-Flugge et al., 2022 for recent reviews), and, (2) allow this to be a more targeted focus of future work (specifically we plan on focusing on striatal-cerebellar interactions during learning, which are hypothesized based on the neuroanatomical tract tracing work of Bostan and Strick, etc.). We hope the reviewers respect our decisions in this regard.

      Nevertheless, we acknowledge that based on our statements about ‘whole-brain’ connectivity and vagueness about what we mean by ‘subcortex,’ that this may be confusing for the reader. We have now removed and/or corrected such references throughout the paper (however, note that in some cases it is difficult to avoid reference to “whole-brain” — e.g., “whole-brain correlation map” or “whole-brain false discovery rate correction”, which is standard terminology in the field).

      In addition, we are now explicit in our Methods section that the cerebellum was not included in our analyses.

      “Each volume comprised 34 contiguous (no gap) oblique slices acquired at a ~30° caudal tilt with respect to the plane of the anterior and posterior commissure (AC-PC), providing whole-brain coverage of the cerebrum and cerebellum. Note that for the current study, we did not examine changes in cerebellar activity during learning.”

      (4d) The authors centered the matrices before further analyses to remove variance associated with the subject. Why not run a PCA on the connectivity matrices and remove the PC that is associated with subject variance? What is the advantage of first centering the connectivity matrices? Is this standard practice in the field?

      Centering in some form has become reasonably common in the functional connectivity literature, as there is considerable evidence that task-related (or cognitive) changes in whole-brain connectivity are dwarfed by static, subject-level differences (e.g., Gratton, et al, 2018, Neuron). If covariance matrices were ordinary scalar values, then isolating task-related changes could be accomplished simply by subtracting a baseline scan or mean score; but because the space of covariance matrices is non-Euclidean, the actual computations involved in this subtraction are more complex (see our Methods). However, fundamentally (and conceptually) our procedure is simply ordinary mean-centering, but adapted to this non-Euclidean space. Despite the added complexity, there is considerable evidence that such computations — adapted directly to the geometry of the space of covariance matrices — outperform simpler methods, which treat covariance matrices as arrays of real numbers (e.g. naive substraction, see Dodero et al. & Ng et al., references below). Moreover, our previous work has found that this procedure works quite well to isolate changes associated with different task conditions (Areshenkoff et al., 2021, Neuroimage; Areshenkoff et al., 2022, elife).

      Although PCA can be adapted to work well with covariance matrix valued data, it would at best be a less direct solution than simply subtracting subjects' mean connectivity. This is because the top components from applying PCA would be dominated by both subject-specific effects (not of interest here), and by the large-scale connectivity structure typically observed in component based analyses of whole-brain connectivity (i.e. the principal gradient), whereas changes associated with task-condition (the thing of interest here) would be buried among the less reliable components. By contrast, our procedure directly isolates these task changes.

      References cited above:

      Dodero, L., Minh, H. Q., San Biagio, M., Murino, V., & Sona, D. (2015, April). Kernel-based classification for brain connectivity graphs on the Riemannian manifold of positive definite matrices. In 2015 IEEE 12th international symposium on biomedical imaging (ISBI) (pp. 42-45). IEEE.

      Ng, B., Dressler, M., Varoquaux, G., Poline, J. B., Greicius, M., & Thirion, B. (2014). Transport on Riemannian manifold for functional connectivity-based classification. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2014: 17th International Conference, Boston, MA, USA, September 14-18, 2014, Proceedings, Part II 17 (pp. 405-412). Springer International Publishing.

      (4e) Seems like a missed opportunity that the authors just use a single, PCA-derived measure to quantify learning, where multiple measures could have been of interest, especially given that the introduction established some interesting learning-related concepts related to exploration and exploitation, which could be conceptualized as movement variability and movement accuracy. It is unclear why the authors designed a task that was this novel and interesting, drawing on several psychological concepts, but then chose to ignore these concepts in the analysis.

      We were disappointed to hear that the reviewers did not appreciate our functional PCA-derived measure to quantify subject learning. This is a novel data-driven analysis approach that we have previously used with success in recent work (e.g., Areshenkoff et al., 2022, elife) and, from our perspective, we thought it was quite elegant that we were able to describe the entire trajectory of learning across all participants along a single axis that explained the majority (~75%) of the variance in the patterns of behavioral learning data. Moreover, the creation of a single behavioral measure per participant (what we call a ‘Learning score’, see Fig. 6C) helped simplify our brain-behavior correlation analyses considerably, as it provided a single measure that accounts for the natural auto-correlation in subjects’ learning curves (i.e., that subjects who learn quickly also tend to be better overall learners by the end of the learning phase). It also avoids the difficulty (and sometimes arbitrariness) of having to select specific trial bins for behavioral analysis (e.g., choosing the first 5, 10, 20 or 25 trials as a measure of ‘early learning’, and so on). Of course, one of the major alternatives to our approach would have involved fitting an exponential to each subject’s learning curves and taking measures like learning rate etc., but in our experience we have found that these types of models don’t always fit well, or derive robust/reliable parameters at the individual subject level. To strengthen the motivation for our approach, we have now included the following text in our Results:

      “To quantify this variation in subject performance in a manner that accounted the auto-correlation in learning performance over time (i.e., subjects who learned more quickly tend to exhibit better performance by the end of learning), we opted for a pure data-driven approach and performed functional principal component analysis (fPCA; (Shang, 2014)) on subjects’ learning curves. This approach allowed us to isolate the dominant patterns of variability in subject’s learning curves over time (see Methods for further details; see also Areshenkoff et al., 2022).”

      In any case, the reviewers may be pleased to hear that in current work in the lab we are using more model-based approaches to attempt to derive sets of parameters (per participant) that relate to some of the variables of interest described by the reviewers, but that we relate to much more dynamical (shorter-term) changes in brain activity.

      (4f) Overall Changes in Activity: The manuscript should delve into the potential influence of overall changes in brain activity on the results. The choice of using Euclidean distance as a metric for quantifying changes in connectivity is sensitive to scaling in overall activity. Therefore, it is crucial to discuss whether activity in task-relevant areas increases from baseline to early learning and decreases from early to late learning, or if other patterns emerge. A comprehensive analysis of overall activity changes will provide a more complete understanding of the findings.

      These are good questions and we are happy to explore this in the data. However, as mentioned in our response to query 4a above, it is important to note that the timeseries data for each brain region was z-scored prior to analysis, with the aim of removing any mean changes in activity levels (note that this is a standard preprocessing step when performing functional connectivity analysis, given that mean signal changes are not the focus of interest in functional connectivity analyses).

      To further emphasize these points, we have taken our z-scored timeseries data and calculated the mean signal for each region within each task epoch (Baseline, Early and Late learning, see panel A in figure below). The point of showing this data (where each z-score map looks near identical across the top, middle and bottom plots) is to demonstrate just how miniscule the mean signal changes are in the z-scored timeseries data. This point can also be observed when plotting the mean z-score signal across regions for each epoch (see panel B in figure below). Here we find that Baseline and Early learning have a near identical mean activation level across regions (albeit with slightly different variability across subjects), whereas there is a slight increase during late learning — though it should be noted that our y-axis, which measures in the thousandths, really magnifies this effect.

      To more directly address the reviewers’ comments, using the z-score signal per region we have also performed the same statistical pairwise comparisons (Early > Baseline and Late>Early) as we performed in the main manuscript Fig. 4 (see panel C in Author response image 9 below). In this plot, areas in red denote an increase in activity from Baseline to Early learning (top plot) and from Early to Late learning (bottom plot), whereas areas in blue denote a decrease for those same comparisons. The important thing to emphasize here is that the spatial maps resulting from this analysis are generally quite different from the maps of eccentricity that we report in Fig. 4 in our paper. For instance, in the figure below, we see significant changes in the activity of visual cortex between epochs but this is not found in our eccentricity results (compare with Fig. 4). Likewise, in our eccentricity results (Fig. 4), we find significant changes in the manifold positioning of areas in medial prefrontal cortex (MPFC), but this is not observed in the activation levels of these regions (panel C below). Again, we are hesitant to make too much of these results, as the activation differences denoted as significant in the figure below are likely to be an effect on the order of thousandths of a z-score (e.g., 0.002 > 0.001), but this hopefully assuages reviewers’ concerns that our manifold results are solely attributable to changes in overall activity levels.

      We are hesitant to include the results below in our paper as we feel that they don’t add much to the interpretation (as the purpose of z-scoring was to remove large activation differences). However, if the reviewers strongly believe otherwise, we would consider including them in the supplement.

      Author response image 9.

      Examination of overall changes in activity across regions. (A) Mean z-score maps across subjects for the Baseline (top), Early Learning (middle) and Late learning (bottom) epochs. (B) Mean z-score across brain regions for each epoch. Error bars represent +/- 1 SEM. (C) Pairwise contrasts of the z-score signal between task epochs. Positive (red) and negative (blue) values show significant increases and decreases in z-score signal, respectively, following FDR correction for region-wise paired t-tests (at q<0.05).

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer 1:

      In the article titled "Polyphosphate discriminates protein conformational ensembles more efficiently than DNA promoting diverse assembly and maturation behaviors," Goyal and colleagues investigate the role of negatively charged biopolymers, i.e., polyphosphate (polyP) and DNA, play in phase separation of cytidine repressor (CytR) and fructose repressor (FruR). The authors find that both negative polymers drive the formation of metastable protein/polymer condensates. However, polyPdriven condensates form more gel- or solid-like structures over time while DNA-driven condensates tend to dissipate over time. The authors link this disparate condensate behavior to polyP-induced structures within the enzymes. Specifically, they observe the formation of polyproline II-like structures within two tested enzyme variants in the presence of polyP. Together their results provide a unique insight into the physical and structural mechanism by which two unique negatively charged polymers can induce distinct phase transitions with the same protein. This study will be a welcomed addition to the condensate field and provide new molecular insights into how binding partner-induced structural changes within a given protein can affect the mesoscale behavior of condensates. The concerns outlined below are meant to strengthen the manuscript.

      Recommendation:

      We value the reviewer’s positive comments and appreciate time taken to provide detailed feedback that has certainly helped improve our manuscript.

      Major Concerns:

      (1) The biggest concern in this manuscript lies with experiments comparing polyP45, which has a net negative charge of -47, and double-stranded DNA of 45 base pairs (as stated in the methods), which will have a net negative charge of -90. Given the dependence of phase separation and phase transitions on not only net charge but charge density, this is an important factor to consider when comparing the effect of these molecules. It is unclear how or if the authors considered these factors in the design of their experiments. Because of the factor of 2 difference in net charge over the same number of polymer chain components, i.e. a chain of 45 pi vs. a chain of 45 double-stranded base pairs, it is unclear if the results from polyP vs. DNA are directly comparable. One solution would be to repeat all DNA experiments using single-stranded DNA so that the net charge is similar to polyP over the same chain length. Another possibility would be to repeat DNA experiments using a doublestranded DNA of 23 base pairs. This would allow for a nearly equal net charge (-46 vs. -47 for polyP), but the charge density would still be 2X polyP. As it stands now, the perceived differences in DNA vs. polyP behavior may be an artifact arising from the difference in net charge and charge density between DNA and polyP.

      To address the reviewer’s concerns regarding charge density differences between polyP and DNA, we conducted an experiment using a higher DNA concentration (11.24 µM) to obtain charge equivalence between the two experiments (i.e. the total concentration of charges). As shown in Figure S5, even at higher DNA concentration, the condensates undergo progressive dissolution over time. This observation indicates that the differential maturation of condensates, arising from distinct initial protein ensembles, are governed by the intrinsic properties of polyP. Charge density (i.e. the number of charges per unit volume of the polymer), on the other hand, is an intrinsic feature of the polymer which is naturally different between DNA and polyP. In fact, the primary result of our work is our observation that polyP can discern the starting ensembles more efficiently, likely through actively engaging and interacting with the ensemble while DNA appears to be a passive player. The differences are not an artifact as they arise from fundamental features of two natural anionic polymers found within cells. In other words, the outcomes could be very different if the concentration of one polymer dominates over the other (see the response below).

      (2) One outstanding question the authors do not address relates to how mixtures of CytR or FruR, DNA, and polyP behave. In the bacterial cytoplasm, these molecules are all in the same compartment (admittedly that compartment is not well mixed due to unique condensate-driven organization). Would the authors expect to see similar effects of polyP and DNA if they were in the same solution? Perhaps the authors could run a set of experiments where they vary the ratios of DNA and polyP to probe how increased levels of "stress", i.e. increased levels of polyP vs. DNA, alter the formation and behavior of enzymatic condensates.

      Following this comment, we investigated the phase separation behavior of CytR WT in the presence of different charge ratios of polyP-DNA mixtures. As seen in Author response image 1,panel A below, the outcomes are highly sensitive to the starting concentrations: at higher charge concentration of polyP (left panel), the OD and ThT fluorescence intensity is high at lower time points, both decrease and increase again. Fluorescence microscopy images (panel B) reveal similar trends, but the more fascinating outcome are the FRAP recovery profiles which recover extremely fast and fully at zero time point (panel C) despite aggregation-like tendencies observed in ThT fluorescence assays. However, at longer time points (20 and 40 mins) the FRAP recovery is significantly weaker but recovers to ~65% at 1 hour (panel C). At high relative polyP concentrations with respect to DNA, droplets are formed first which then transition into aggregates (liquid-to-solid transition; middle image in panel A). At relatively high DNA concentrations it appears that both droplets and aggregates co-exist as both OD and ThT fluorescence are moderately high. Given these complex behaviors, we have not included the same in the current manuscript as we still do not fully understand the origins of these differences. In fact, we are planning to extend this study by exploring the combinations in detail to understand the relative roles played by the two polymers in ternary mixtures.

      Author response image 1.

      (3) In Figure 1H, the recovery trace shows the fractional recovery of DM to near WT levels. It is clear from the images that recovery of the bleached region occurs, but the overall fluorescence intensity of DM is much lower than WT, even when accounting for the difference in starting condensate sizes in the Pre-Bleach images. Shouldn't this qualitative difference in total fluorescence be reflected in the quantitative trace?

      In Figure 2H, as the reviewer rightly points out, there is a clear difference in the absolute fluorescence intensity between WT and DM condensates. We would like to clarify that the recovery traces shown in Figure 2I were normalized to the pre-bleach intensity of each individual condensate to reflect fractional recovery. This normalization is intended to highlight the relative mobility of the protein within each condensate, but it does not capture the difference in total fluorescence intensity between WT and DM.

      (4) A description of the molten-globular variant Y19A FruR should be included in the main text where the variant is introduced. There is currently no additional description of the molten-globular variant in the Supplement as suggested by the manuscript.

      Figure 6A depicts the three-dimensional structure of FruR WT, with tyrosine residues Y19 and Y28, shown in red, forming stacking interactions. In the Y19A mutant, the loss of these interactions results in little changes in secondary structure (as shown in Figure 6E) but disrupts the protein’s tertiary structure, resulting in a molten globular state. The FruR work is now published in JPCB and can be found at https://doi.org/10.1021/acs.jpcb.4c03895, and is also appropriately cited in the revised version (reference 53).

      (5) Throughout the manuscript, the authors discuss polyP and DNA being able (or unable) to "distinguish" between different variants of CytR and FruR. This is confusing and suggests that DNA or polyP can choose to bind one form over another. The authors should re-work the language in this section to better reflect their direct observations for the behavior of protein in CD experiments and condensate behavior in imaging and turbidity experiments.

      We have now modified the text where necessary. The experiments were not done in the presence of both polyP and DNA, but in isolation (protein + polyP or protein + DNA). Hence, our aim is to convey that polyP is the polymer that leads to variable outcomes because of its ability to ‘interact’ differently with the different starting ensembles.

      Minor Concerns:

      (1) For all Figures, please include the number of measurements, i.e., N = ...

      We have updated all figure legends to include the number of measurements, indicated as N = ..., as suggested.

      (2) For all Figures, please place panel labels, i.e., A, B, C, etc., in the same respective location for each panel. As currently mapped out, it is difficult to easily determine which data are associated with each panel because the IDs are in various locations.

      Due to variations in data presentation and spacing within individual plots, it was challenging to place all labels in exactly the same position without obscuring important details. We have therefore maintained the labels as they were before.

      (3) In the introduction, it would be helpful for the authors to specify exactly what is meant by chaperone. Given the context, it seems that the authors refer to the chaperone activity as one that prevents aggregation. Is this correct?

      We refer to chaperone activity specifically as the ability to prevent aggregation of proteins. We have now clarified this definition in the Introduction section of the revised manuscript.

      (4) The results for experiments shown in Figure 3 need additional setup in the text. Were these measurements taken immediately after mixing WT, DM, or P33A with polyP? If so, why do condensates immediately appear and then dissipate before ThT-detected aggregates begin forming? Or were condensates allowed to form and then transferred to a different buffer, after which measurements were taken? Without a brief description of the experimental setup, interpreting the results is difficult.

      The condensates appear immediately after adding polyP to protein solutions, indicating that the condensate phase is kinetically accessible on mixing polyP with DM or the WT. As illustrated in Figure 3A and 3B, for WT protein, the condensates undergo liquid to solid transition over the time as this likely is the most thermodynamically stable phase. Effectively, this work is to convey that it is important to look at time-dependence of even droplets when formed as they may not be the most stable phase.

      (5) Please include images of P33A over the time course of the experiment in Figure 3B.

      We have included the representative images of P33A in presence of polyP over the time in Figure 3B in the revised manuscript.

      (6) In Figures 3D, E, G, and H, please plot each measurement separately with mean and standard deviation to enable the reader to see each data point.

      We have now revised Figures 3D, E, G, and H to show individual data points along with the mean and standard deviation.

      (7) In the top paragraph on page 12, "fast-moving molecules" can be replaced with "dynamic molecules", as this offers a better description of the FRAP data.

      We have incorporated the suggested changes.

      (8) In the "Structural changes within the condensates spans over three hours" results section on page 15, the conclusion reads "In summary, we find that both the WT and the DM 'unfold' on forming condensates with polyP..." The way this is written suggests that WT and DM behave in a similar manner. Given the CD data, however, it seems that by 4 hours, DM forms alpha helices while the WT does not. This suggests that while each unfolds, the conformation at 4 hours is different. The summary should reflect these differences.

      We fully agree with the reviewer on this. The summary is now modified to include the fact the DM forms alpha helices at 4 hours while the WT does not.

      (9) At the end of the first paragraph of the results section "DNA does not discriminate the conformational ensembles" the authors should refer to Figure 2G, where they show the altered morphology of polP-P33A condensates.

      We have now included the reference to Figure 2G.

      (10) The authors refer to droplets "solubilizing" throughout the manuscript. It seems that dissolve is a better term to use. Solubilize is better associated with individual biomolecules while dissolve is better associated with condensate behavior.

      We thank the reviewer for pointing this out. We have revised the manuscript to replace “solubilize” with “dissolve”.

      (11) In Figures 5L and 5N, please change the Y-axis scale so that each curve is visible on the plot.

      We have adjusted the Y-axis scale in Figures 5L, 5M, and 5N to ensure that each curve is clearly visible and for easier comparison among the variants.

      (12) The authors should show an image of FruR WT and Y19A with DNA for a direct comparison with experiments in which FruR and polyP were used. The addition of turbidity measurements of samples shown in Figure 6D will offer another direct comparison. As written, there is no way for the author to directly compare the effects of polyP and DNA on FruR phase transitions.

      As suggested, we have now included representative images of FruR WT and Y19A with DNA (Figure 6K and 6L) to enable a direct comparison with the FruR–polyP experiments. Also, we have already shown turbidity measurements in Figure 6B and 6C corresponding to the samples shown in Figure 6D.

      Reviewer 2:

      In this study, Goyal et al demonstrate that the assembly of proteins with polyphosphate into either condensates or aggregates can reveal information on the initial protein ensemble. They show that, unlike DNA, polyphosphate is able to effectively discriminate against initial protein ensembles with different conformational heterogeneity, structure, and compactness. The authors further show that the protein native ensemble is vital on whether polyphosphate induces phase separation or aggregation, whereas DNA induces a similar outcome regardless of the initial protein ensemble. This work provides a way to improve our mechanistic understanding of how conformational transitions of proteins may regulate or drive LLPS condensate and aggregate assemblies within biological systems.

      We thank the reviewer for the favorable comments on the manuscript.

      Major Concerns:

      (1) The authors are using bacterial proteins (CytR and FruR) and solely represent polyphosphates as polyP45 (a polyphosphate with 45 Pi units). However, in bacterial systems, polyphosphates can be significantly longer (in the order of 100s to 1000 Pi units). Additionally, the experiments were run at neutral pH (7.0), and though this is fairly appropriate for the cytoplasm, volutin granules (where polyphosphates often accumulate) are typically considered slightly acidic (pH 5.5-6.5). From a physiological perspective, understanding how pH and the length of polyphosphate influence the ability to induce condensates or aggregates could be of importance.

      We appreciate the reviewer’s insightful comments regarding the physiological relevance of polyphosphate length and pH. In our current study, we used polyP45 as it is easily available commercially and we conducted our experiments at pH 7 to mimic the general cytoplasm conditions. We agree that polyphosphates in bacterial cells can be significantly longer (hundreds to thousands of Pi units) and conducting experiments at slightly more acidic environment would be physiologically relevant. We plan to use longer polyP from Regene Tiss Inc. and acidic pH to explore how polyphosphate-induced phase separation of CytR vary with pH as a part of a future study. One could imagine doing all the experiments listed in the manuscript at different pH conditions for the different variants, but this could not be a part of the current work which has a specific focus on the differences in maturation properties depending on the nature of starting ensemble. However, the pKa values of the internal hydroxyl groups is ~2.2 (DOI:10.2147/IJN.S389819) indicating that the polyP carries near identical charges in the pH range between 4-7, and hence we expect little change in the charged status of polyP. On the other hand, the protonation states of charged amino acids within CytR could vary with pH, thus influencing its assembly properties.

      (2) In the study, the longest metastable condensate induced by polyphosphate lasted approximately 3 hours before resolubilizing. It would be nice if the authors were able to generate a longer-lived condensate phase that would enable further mechanistic studies (e.g., NMR).

      We agree that generating longer-lived condensates would be highly valuable for mechanistic studies. However, the formation and stability of condensates is an intrinsic property of protein, and optimizing different conditions for a longer-lived condensate phase is beyond the scope of the current study. It is possible that the condensates are long-lived with longer polyP, but it is not clear if this would indeed be the case. We would also like to state here that while it is common to report on the liquid-to-solid transition in condensates, the intrinsic metastability of droplets (when there is no aggregation) is rarely reported. One possibility is to mutationally introduce cysteine residues and induce the formation of disulphide bridges (as done in a recent work, doi: 10.1021/jacs.4c09557) that make the condensate highly stable kinetically; however, this would also complicate the interpretation as the mechanism of condensate formation might be very different. We have therefore reported our results as an observation arising from differences in the nature of the poly-anionic polymers.

      (3) The authors showed that CytR DM (fully folded), CytR WT (minor state folded), and CytR P33A (highly disordered) with polyphosphates lead to longer-lived condensates that resolubilize, shorterlived condensates that aggregate, and immediate aggregating, respectively. Whereas FruR (folded) and FruR Y19A (molten globular) with polyphosphate induce spontaneous aggregation and short-lived condensates, respectively. I would expect FruR to be more similar to CytR DM and FruR Y19A more similar to CytR WT in terms of structure and conformational dynamics and plasticity, yet they have opposing results. This raises a bit of concern. Meaning, that though polyphosphate discriminates between the different ensembles, is it actually possible to obtain information on the initial ensemble composition?

      In the current study, we show that CytR WT (less structured) and FruR Y19A (molten globule) form short-lived condensates that aggregate. We agree with the reviewer that while CytR DM (fully folded) forms condensates that dissolve over time, FruR WT (fully folded) variant forms aggregates immediately upon polyP addition. The observations show that polyP can discriminate between different protein conformations, in contrast to DNA, which does not show such selectivity. However, we acknowledge that while polyP-induced behavior reflects aspects of protein ensemble properties, it does not provide direct insight into the nature of the initial conformational ensemble.

      (4) In the case of FruR with polyphosphate, no CD for the secondary structure analysis was provided as it was for CytR. It would be useful to see if the polyphosphate-induced structural changes observed for CytR hold true for FruR as well.

      We thank the reviewer for the suggestion. In response, we have performed far-UV CD experiments on FruR variants in the presence of polyP. Similar to the CytR WT, FruR WT shows unfolding upon polyP addition. A similar outcome is noted for the Y19A variant though there is significant residual helix content in the condensate unlike the WT. The CD spectra of FruR variants have been added to Figure 6.

      Minor Concerns/Suggestions:

      Under conclusion, third paragraph, first sentence. This sentence reads, "Our observations thus establish that polyP efficiently discriminates the conformational features of proteins than DNA, contributing to the diverse outcomes."

      We thank the reviewer for pointing this out. The sentence has been revised for clarity. It now reads “Our observations establish that polyP is more sensitive to the conformational features of proteins than DNA, thereby contributing to the diverse outcomes.”

      One experimental suggestion. Seeing that protein dynamics and plasticity seem to play a role. For either CytR WT or DM, it would be interesting to see the influence of temperature. Altering the temperature is a good way to perturb the population distribution of conformation sub-states and to alter kinetics. It may be that at a lower temperature (maybe 5C) for the WT you reduce conformational dynamics and you obtain results more similar to that of the DM. Alternatively, heating the DM would be another option. Obviously, there are additional challenges that may arise with changing the temperature, but if it were to work I think it could add some value.

      We thank the reviewer for the thoughtful suggestion. Due to limitations in our current experimental setup (as the reviewer notes as ‘challenges’)- the confocal set up does not have a temperature controller - we will not be to perform temperature-controlled assays. However, the ‘structure’ of CytR variants do not vary much between 280 – 298 K, and this is one of the reasons for choosing three variants without altering any other thermodynamic property. If temperature were varied, the dynamics of polyP would also change and hence the true molecule origins of any differences we might observe will be confounded by the dynamic effects on polyP as well. In this work, we have eliminated any dynamic differences in polyP by performing the experiments at a fixed temperature.

    1. Author response:

      The following is the authors’ response to the current reviews.

      Reviewer #1 (Public Review):

      Summary:

      This manuscript explores the impact of serotonin on olfactory coding in the antennal lobe of locusts and odor-evoked behavior. The authors use serotonin injections paired with an odorevoked palp-opening response assay and bath application of serotonin with intracellular recordings of odor-evoked responses from projection neurons (PNs).

      Strengths:

      The authors make several interesting observations, including that serotonin enhances behavioral responses to appetitive odors in starved and fed animals, induces spontaneous bursting in PNs, directly impacts PN excitability, and uniformly enhances PN responses to odors.

      Weaknesses:

      The one remaining issue to be resolved is the theoretical discrepancy between the physiology and the behavior. The authors provide a computational model that could explain this discrepancy and provide the caveat that while the physiological data was collected from the antennal lobe, but there could be other olfactory processing stages involved. Indeed other processing stages could be the sites for the computational functions proposed by the model. There is an additional caveat which is that the physiological data were collected 5-10 minutes after serotonin application whereas the behavioral data were collected 3 hours after serotonin application. It is difficult to link physiological processes induced 5 minutes into serotonin application to behavioral consequences 3 hours subsequent to serotonin application. The discrepancy between physiology and behavior could easily reflect the timing of action of serotonin (i.e. differences between immediate and longer-term impact).

      For our behavioral experiments, we waited 3 hours after serotonin injection to allow serotonin to penetrate through the layers of air sacks and the sheath, and for the locusts to calm down and recover their baseline POR activity levels. For the physiology experiments, we noticed that the quality of the patch decreased over time after serotonin introduction. Hence, it was difficult to hold cells for that long. However, the point raised by the reviewer is well-taken. We have performed additional experiments to show that the changes in POR levels to different odorants are rapid and can be observed within 15 minutes of injecting serotonin (Author response image 2) and that the physiological changes in PNs (bursting spontaneous activity, maintenance of temporal firing patterns, and increase odor-evoked responses) persists when the cells are held for longer duration (i.e. 3 hours akin to our behavioral experiments). It is worth noting that 3-hour in-vivo intracellular recordings are not easily achievable and come with many experimental constraints. So far, we have managed to record from two PNs that were held for this long and add them to this rebuttal to support our conclusions. (Author response image 1).

      Author response image 1.

      Spontaneous and odor-evoked responses in individual PNs remain consistent for three hours after serotonin introduction into the recording chamber/bath. (A) Representative intracellular recording showing membrane potential fluctuations in a projection neuron (PN) in the antennal lobe. Spontaneous and odor-evoked responses to four odorants (pink color bars, 4 s duration) are shown before (control) and after serotonin application (5HT). Voltage traces 30 minutes (30min), 1 hour (1h), 2 hours (2h), and 3 hours (3h) after 5HT application are shown to illustrate the persisting effect of serotonin during spontaneous and odor-evoked activity periods. (B) Rasterized spiking activities in two recorded PNs are shown. Spontaneous and odor-evoked responses are shown in all 5 consecutive trials. Note that the odor-evoked response patterns are maintained, but the spontaneous activity patterns are altered after serotonin introduction.

      Author response image 2.

      Palp-opening response (POR) patterns to different odorants remain consistent following serotonin introduction. The probability of PORs is shown as a bar plot for four different odorants; hexanol (green), benzaldehyde (blue), linalool (red), and ammonium (purple). PORs before serotonin injection (solid bars) are compared against response levels after serotonin injection (striped bars). As can be noted, PORs to the four odorants remain consistent when tested 15 minutes and 3 hours after (5HT) serotonin injection.

      Overall, the study demonstrates the impact of serotonin on odor-evoked responses of PNs and odor-guided behavior in locusts. Serotonin appears to have non-linear effects including changing the firing patterns of PNs from monotonic to bursting and altering behavioral responses in an odor-specific manner, rather than uniformly across all stimuli presented.

      We thank the reviewer for again providing very useful feedback for improving our manuscript.

      Reviewer #2 (Public Review):

      Summary:

      The authors investigate the influence of serotonin on feeding behavior and electrophysiological responses in the antennal lobe of locusts. They find that serotonin injection changes behavior in an odor-specific way. In physiology experiments, they can show that projection neurons in the antennal lobe generally increase their baseline firing and odor responses upon serotonin injection. Using a modeling approach the authors propose a framework on how a general increase in antennal lobe output can lead to odor-specific changes in behavior.

      Strengths:

      This study shows that serotonin affects feeding behavior and odor processing in the antennal lobe of locusts, as serotonin injection increases activity levels of projection neurons. This study provides another piece of evidence that serotonin is a general neuromodulator within the early olfactory processing system across insects and even phyla.

      Weaknesses:

      I still have several concerns regarding the generalizability of the model and interpretation of results. The authors cannot provide evidence that serotonin modulation of projection neurons impacts behavior.

      This is true and likely to be true for any study linking neural responses to behavior. There are multiple circuits and pathways that would get impacted by a neuromodulator like serotonin. What we showed with our physiology is how spontaneous and odor-evoked responses in the very first neural network that receives olfactory sensory neuron input are altered by serotonin. Given the specificity of the changes in behavioral outcomes (i.e. odor-specific increase and decrease in an appetitive behavior) and non-specificity in the changes at the level of individual PNs (general increase in odor-evoked spiking activity), we presented a relatively simple computational model to address the apparent mismatch between neural and behavioral responses. (Author response image 4).

      The authors show that odor identity is maintained after 5-HT injection, however, the authors do not show if PN responses to different odors were differently affected after serotonin exposure.

      The PN responses to different odorants changed in a qualitatively similar fashion. (Author response image 3)

      Author response image 3.

      PN activity before and after 5HT application are compared for different cellodor combinations. As can be noted, the changes are qualitatively similar in all cases. After 5HT application, the baseline activity became more bursty, but the odor-evoked response patterns were robustly maintained for all odorants.

      Regarding the model, the authors show that the model works for odors with non-overlapping PN activation. However, only one appetitive, one neutral, and one aversive odor has been tested and modeled here. Can the fixed-weight model also hold for other appetitive and aversive odors that might share more overlap between active PNs? How could the model generate BZA attraction in 5-HT exposed animals (as seen in behavior data in Figure 1) if the same PNs just get activated more?

      Author response image 4.

      Testing the generality of the proposed computational model. To test the generality of the model proposed we used a published dataset [Chandak and Raman, 2023]: Neural dataset – 89 PN responses to a panel of twenty-two odorants; Behavioral dataset – probability of POR responses to the same twenty-two odorants. We built the model using just the three odorants overlapping between the two datasets: hexanol, benzaldehyde and linalool. The true probability of POR values of the twenty odorants and the POR probability predicted by the model are shown for all twenty-two odorants as a scatter plot. As can be noted, there is a high correlation (0.79) between the true and the predicted values.

      The authors should still not exclude the possibility that serotonin injections could affect behavior via modulation of other cell types than projection neurons. This should still be discussed, serotonin might rather shut down baseline activation of local inhibitory neurons - and thus lead to the interesting bursting phenotypes, which can also be seen in the baseline response, due to local PN-to-LN feedback.

      As we agreed, there could be other cells that are impacted by serotonin release. Our goal in this study was to characterize how spontaneous and odor-evoked responses in the very first neural network that receives olfactory sensory neuron input are altered by serotonin. Within this circuit, there are local inhibitory neurons (LNs), as correctly indicated by this reviewer. Surprisingly, our preliminary data indicates that LNs are not shut down but also have an enhanced odor-evoked neural response. (Author response image 5.) Further data would be needed to verify this observation and determine the mechanism that mediate the changes in PN excitability. Irrespective, since PN activity should incorporate the effects of changes in the local neuron responses and is the sole output from the antennal lobe that drives all downstream odor-evoked activity, we focused on them in this study.

      Author response image 5.

      Representative traces showing intracellular recording from a local neuron in the antennal lobe. Five consecutive trials are shown. Note that LNs in the locust antennal lobe are non-spiking. The LN activity before, during, and after the presentation of benzaldehyde and hexanol (colored bar; 4s) are shown. The Left and Right panels show LN activity before and after the application of 5HT. As can be noted, 5HT did not shut down odor-evoked activity in this local neuron.

      The authors did not fully tone down their claims regarding causality between serotonin and starved state behavioral responses. There is no proof that serotonin injection mimics starved behavioral responses.

      Specific minor issues:<br /> It is still unclear how naturalistic the chosen odor concentrations are. This is especially important as behavioral responses to different concentrations of odors are differently modulated after serotonin injection (Figure 2: Linalool and Ammonium). The new method part does not indicate the concentrations of odors used for electrophysiology.

      All odorants were diluted to 0.01-10% concentration by volume in either mineral oil or distilled water. This information is included in the Methods section. For most odorants used in the study, the lower concentrations only evoked a very weak neural response, and the higher concentrations evoked more robust responses. The POR responses for these odorants at various concentrations chosen are included in Figure 2. Note, that the responses to linalool and ammonium remained weak throughout the concentration changes, compared to hexanol and benzaldehyde.

      Did all tested PNs respond to all odorants?

      No, only a subset of them responses to each odorant. These responses have been well characterized in earlier publications [included refs].

      The authors do not show if PN responses to different odors were differently affected after serotonin exposure. They describe that ON responses were robust, but OFF responses were less consistent after 5-HT injection. Was this true across all odors tested? Example traces are shown, but the odor is not indicated in Figure 4A. Figure 4D shows that many odor-PN combinations did not change their peak spiking activity - was this true across odorants? In Figure 5 - are PNs ordered by odor-type exposure?

      Also, Figure 6A only shows example trajectories for odorants - how does the average look? Regarding the data used for the model - can the new dataset from the 82 odor-PN pairs reproduce the activation pattern of the previously collected dataset of 89 pairs?

      What is shown in Figure 6A is the trial-averaged response trajectory combining activities of all 82 odor-PN pairs. 82 odor-PN pair was collected intracellularly examining the responses to four odorants before and after 5HT application. The second dataset involving 89 PN responses to 22 odorants was collected extracellularly. They have qualitative similarities in each odorant activate a unique subset of those neurons.

      The authors toned down their claims that serotonin injection can mimic the starved state behavioral response. However, some sentences still indicate this finding and should also be toned down:

      last sentence of introduction - "In sum, our results provide a more systems-level view of how a specific neuromodulator (serotonin) alters neural circuits to produce flexible behavioral outcomes."

      We believe we showed this with our computational model, how uniform changes in the neural responses could lead to variable and odor-specific changes in behavioral PORs.

      discussion: "Finally, fed locusts injected with serotonin generated similar appetitive responses to food-related odorants as starved locusts indicating the role of serotonin in hunger statedependent modulation of odor-evoked responses." This claim is not supported.

      Figure 7 shows that the fed locusts had lower POR to hex and bza. The POR responses significantly increased after the 5HT application. However, we have rephrased this sentence to limit our claims to this result. "Finally, fed locusts injected with serotonin generated similar appetitive palp-opening responses to food-related odorants as observed in starved locusts”

      last results: "However, consistent with results from the hungry locusts, the introduction of serotonin increased the appetitive POR responses to HEX and BZA. Intriguingly, the appetitive responses of fed locusts treated with 5HT were comparable or slightly higher than the responses of hungry locusts to the same set of odorants."

      Again this sentence simply describes the result shown in Figure 7.

      In Figure 7 - BZA response seems unchanged in hungry and fed animals and only 5-HT injection enhances the response. There is only one example where 5-HT application and starvation induce the same change in behavior - N=1 is not enough to conclude that serotonin influences food-driven behaviors.

      The reviewer is ignoring the lack of changes to PORs to linalool and ammonium. Taken together, serotonin increased PORs to only two of the four odorants in starved locusts. The responses after 5HT modulation to these four odorants were similar in fed locusts treated with 5HT and starved locusts.

      Also, this seems to be wrongly interpreted in Figure 7: "It is worth noting that responses to LOOL and AMN, non-food related odorants with weaker PORs, remained unchanged in fed locusts treated with 5HT." The authors indicate a significant reduction in POR after 5-HT injection on LOOL response in Figure 7.

      Revised.<br /> It is worth noting that responses to LOOL and AMN, non-food related odorants with weaker PORs, and reduced in fed locusts treated with 5HT."

      Also, the newly added sentence at the end of the discussion does not make sense: "However, since 5HT increased behavioral responses in both fed and hungry locusts, the precise role of 5HT modulation and whether it underlies hunger-state dependent modulation of appetitive behavior still remains to be determined."<br /> The authors did not test 5-HT injection in starved animals

      The results shown in Figure 1 compare the POR responses of starved locusts before and after 5HT introduction.

      We again thank the reviewer for useful feedback to further improve our manuscript.


      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      Summary:

      This manuscript explores the impact of serotonin on olfactory coding in the antennal lobe of locusts and odor-evoked behavior. The authors use serotonin injections paired with an odor-evoked palp-opening response assay and bath application of serotonin with intracellular recordings of odor-evoked responses from projection neurons (PNs).

      Strengths:

      The authors make several interesting observations, including that serotonin enhances behavioral responses to appetitive odors in starved and fed animals, induces spontaneous bursting in PNs, and uniformly enhances PN responses to odors. Overall, I had no technical concerns. Weaknesses:

      While there are several interesting observations, the conclusions that serotonin enhanced sensitivity specifically and that serotonin had feeding-state-specific effects, were not supported by the evidence provided. Furthermore, there were other instances in which much more clarification was needed for me to follow the assumptions being made and inadequate statistical testing was reported.

      Major concerns.

      • To enhance olfactory sensitivity, the expected results would be that serotonin causes locusts to perceive each odor as being at a relatively higher concentration. The authors recapitulate a classic olfactory behavioral phenomenon where higher odor concentrations evoke weaker responses which is indicative of the odors becoming aversive. If serotonin enhanced the sensitivity to odors, then the dose-response curve should have shifted to the left, resulting in a more pronounced aversion to high odor concentrations. However, the authors show an increase in response magnitude across all odor concentrations. I don't think the authors can claim that serotonin enhances the behavioral sensitivity to odors because the locusts no longer show concentration-dependent aversion. Instead, I think the authors can claim that serotonin induces increased olfactory arousal.

      The reviewer makes a valid point. Bath application of serotonin increased POR behavioral responses across all odor concentrations, and concentration-dependent aversion was also not observed. Furthermore, the monotonic relationship between projection neuron responses and the intensity of current injection is altered when serotonin is exogenously introduced (see Author response image 1; see below for more explanation). Hence, our data suggests that serotonin alters the dose-response relationship between neural/behavioral responses and odor intensity. As recommended, we have followed what the reviewer has suggested and revised our claim to serotonin inducing increase in olfactory arousal. The new physiology data has been added as Supplementary Figure 3 to the revised manuscript.

      • The authors report that 5-HT causes PNs to change from tonic to bursting and conclude that this stems from a change in excitability. However, excitability tests (such as I/V plots) were not included, so it's difficult to disambiguate excitability changes from changes in synaptic input from other network components.

      To confirm that the PN excitability did indeed change after serotonin application, we performed a new set of current-clamp recordings. In these experiments, we monitored the spiking activities in individual PNs as we injected different levels of current injections (200 – 1000 pico Amperes). Note that locust LNs that provide recurrent inhibition arborize and integrate inputs from a large number of sensory neurons and projection neurons. Therefore, activating a single PN should not activate the local neurons and therefore the antennal lobe network.

      We found that the total spiking activity monotonically increased with the magnitude of the current injection in all four PNs recorded (Author response image 1). However, after serotonin injection, we found that the spiking activity remained relatively stable and did not systematically vary with the magnitude of the current injection. While the changes in odor-evoked responses may incorporate both excitability changes in individual PNs and recurrent feedback inhibition through GABAergic LNs, these results from our current injection experiments unambiguously indicate that there are changes in excitability at the level of individual PNs. We have added this result to the revised manuscript.

      Author response image 1.

      Current-injection induced spiking activity in individual PNs is altered after serotonin application. (A) Representative intracellular recordings showing membrane potential fluctuations as a function of time for one projection neuron (PNs) in the locust antennal lobe. A two-second window when a positive 200-1000pA current was applied is shown. Firing patterns before (left) and after (right) serotonin application are shown for comparison. Note, the spiking activity changes after the 5HT application. The black bar represents the 20mV scale. (B) Dose-response curves showing the average number of action potentials (across 5 trials) during the 2second current pulse before (green) and after (purple) serotonin for each recorded PN. Note that the current intensity was systematically increased from 200 pA to 1000 pA. The (C) The mean number of spikes across the four recorded cells during current injection is shown. The color progression represents the intensity of applied current ranging 200pA (leftmost bar) to 1000pA (rightmost bar). The dose-response trends before (green) and after (purple) 5HT application are shown for comparison. The error bars represent SEM across the four cells.

      • There is another explanation for the theoretical discrepancy between physiology and behavior, which is that odor coding is further processing in higher brain regions (ie. Other than the antennal lobe) not studied in the physiological component of this study. This should at least be discussed.

      This is a valid argument. For our model of neural mapping onto behavior to work, we only need the odorant that evokes or suppresses PORs to activate a distinct set of neurons. Having said that, our extracellular recording results (Fig. 6E) indicate that hexanol (high POR) and linalool (low POR) do activate highly non-overlapping sets of PNs in the antennal lobe. Hence, our results suggest that the segregation of neural activity based on behavioral relevance already begins in the antennal lobe. We have added this clarification to the discussion section.

      • The authors cannot claim that serotonin underlies a hunger state-dependent modulation, only that serotonin impacts responses to appetitive odors. Serotonin enhanced PORs for starved and fed locusts, so the conclusion would be that serotonin enhances responses regardless of the hunger state. If the authors had antagonized 5-HT receptors and shown that feeding no longer impacts POR, then they could make the claim that serotonin underlies this effect. As it stands, these appear to be two independent phenomena.

      This is also a valid point. We have clarified this in the revised manuscript.

      Reviewer #2 (Public Review):

      Summary:

      The authors investigate the influence of serotonin on feeding behavior and electrophysiological responses in the antennal lobe of locusts. They find that serotonin injection changes behavior in an odorspecific way. In physiology experiments, they can show that antennal lobe neurons generally increase their baseline firing and odor responses upon serotonin injection. Using a modeling approach the authors propose a framework on how a general increase in antennal lobe output can lead to odorspecific changes in behavior. The authors finally suggest that serotonin injection can mimic a change in a hunger state.

      Strengths:

      This study shows that serotonin affects feeding behavior and odor processing in the antennal lobe of locusts, as serotonin injection increases activity levels of antennal lobe neurons. This study provides another piece of evidence that serotonin is a general neuromodulator within the early olfactory processing system across insects and even phyla. Weaknesses:

      I have several concerns regarding missing control experiments, unclear data analysis, and interpretation of results.

      A detailed description of the behavioral experiments is lacking. Did the authors also provide a mineral oil control and did they analyze the baseline POR response? Is there an increase in baseline response after serotonin exposure already at the behavioral output level? It is generally unclear how naturalistic the chosen odor concentrations are. This is especially important as behavioral responses to different concentrations of odors are differently modulated after serotonin injection (Figure 2: Linalool and Ammonium).

      POR protocol: Sixth instar locusts (Schistocera americana) of either sex were starved for 24-48 hours before the experiment or taken straight from the colony and fed blades of grass for the satiated condition. Locusts were immobilized by placing them in the plastic tube and securing their body with black electric tape (see Author response image 2). Locusts were given 20 - 30 minutes to acclimatize after placement in the immobilization tube. As can be noted, the head of the locusts along with the antenna and maxillary palps protruded out of this immobilization tube so they can be freely moved by the locusts. Note that the maxillary palps are sensory organs close to the mouth parts that are used to grab food and help with the feeding process.

      It is worth noting that our earlier studies had shown that the presentation of ‘appetitive odorants’ triggers the locust to open their maxillary palps even when no food is presented (Saha et al., 2017; Nizampatnam et al., 2018; Nizampatnam et al., 2022; Chandak and Raman, 2023.) Furthermore, our earlies results indicate that the probability of palp opening varies across different odorants (Chandak and Raman, 2023). We chose four odorants that had a diverse range of palp-opening: supra-median (hexanol), median (benzaldehyde), and sub-median (linaool). Therefore, each locust in our experiments was presented with one concentration of four odorants (hexanol, benzaldehyde, linalool, and ammonium) in a pseudorandomized order. The odorants were chosen based on our physiology results such that they evoked different levels of spiking activities.

      The odor pulse was 4 s in duration and the inter-pulse interval was set to 60 s. The experiments were recorded using a web camera (Microsoft) placed right in front of the locusts. The camera was fully automated with the custom MATLAB script to start recording 2 seconds before the odor pulse and end recording at odor termination. An LED was used to track the stimulus onset/offset. The POR responses were manually scored offline. Responses to each odorant were scored a 0 or 1 depending on if the palps remained closed or opened. A positive POR was defined as a movement of the maxillary palps during the odor presentation time window as shown on the locust schematic (Main Paper Figure 1).

      Author response image 2.

      Pictures showing the behavior experiment setup and representative palp-opening responses in a locust.

      As the reviewer inquired, we performed a new series of POR experiments, where we explored POR responses to mineral oil and hexanol, before and after serotonin injection. For this study, we used 10 locusts that were starved 24-48 hours before the experiment. Note that hexanol was diluted at 1% (v/v) concentration in mineral oil. Our results reveal that locusts PORs to hexanol (~ 50% PORs) were significantly higher than those triggered by mineral oil (~10% PORs). Injection of serotonin increased the POR response rate to hexanol but did not alter the PORs evoked by mineral oil (Author response image 3).

      Author response image 3.

      Serotonin does not alter the palp-opening responses evoked by paraffin oil. The PORs before and after (5HT) serotonin injection are summarized and shown as a bar plot for hexanol and paraffin oil. Striped bars signify the data collected after 5HT injection. Significant differences are identified in the plot (one-tailed paired-sample t-test; (*p<0.05).

      Regarding recordings of potential PNs - the authors do not provide evidence that they did record from projection neurons and not other types of antennal lobe neurons. Thus, these claims should be phrased more carefully.

      In the locust antennal lobe, only the cholinergic projection neurons fire full-blown sodium spikes. The GABAergic local neurons only fire calcium ‘spikelets’ (Laurent, TINS, 1996; Stopfer et al., 2003; see Author response image 4 for an example). Hence, we are pretty confident that we are only recording from PNs. Furthermore, due to the physiological properties of the LNs, their signals being too small, they are also not detected in the extracellular recordings from the locust antennal lobe. Hence, we are confident with our claims and conclusion.

      Author response image 4.

      PN vs LN physiological differences: Left: A representative raw voltage traces recorded from a local neuron before, during, and after a 4-second odor pulse are shown. Note that the local neurons in the locust antennal lobe do not fire full-blown sodium spikes but only fire small calcium spikelets. On the right: A representative raw voltage trace recorded from a representative projection neuron is shown for comparison. Clear sodium spikes are clearly visible during spontaneous and odor-evoked periods. The gray bar represents 4 seconds of odor pulse. The vertical black bar represents the 40mV.

      The presented model suggests labeled lines in the antennal lobe output of locusts. Could the presented model also explain a shift in behavior from aversion to attraction - such as seen in locusts when they switch from a solitarious to a gregarious state? The authors might want to discuss other possible scenarios, such as that odor evaluation and decision-making take place in higher brain regions, or that other neuromodulators might affect behavioral output. Serotonin injections could affect behavior via modulation of other cell types than antennal lobe neurons. This should also be discussed - the same is true for potential PNs - serotonin might not directly affect this cell type, but might rather shut down local inhibitory neurons.

      There are multiple questions here. First, regarding solitary vs. gregarious states, we are currently repeating these experiments on solitary locusts. Our preliminary results (not included in the manuscript) indicate that the solitary animals have increased olfactory arousal and respond with a higher POR but are less selective and respond similarly to multiple odorants. We are examining the physiology to determine whether the model for mapping neural responses onto behavior could also explain observations in solitary animals.

      Second, this reviewer makes the point raised by Reviewer 1. We agree that odor evaluation and decisionmaking might take place in higher brain regions. All we could conclude based on our data is that a segregation of neural activity based on behavioral relevance might provide the simplest approach to map non-specific increase in stimulus-evoked neural responses onto odor-specific changes in behavioral outcome. Furthermore, our results indicate that hexanol and linalool, two odorants that had an increase and decrease in PORs after serotonin injection, had only minimal neural response overlap in the antennal lobe. These results suggest that the formatting of neural activity to support varying behavioral outcomes might already begin in the antennal lobe. We have added this to our discussion.

      Third, regarding serotonin impacting PNs, we performed a new set of current-clamp experiments to examine this issue (Author response image 1). Our results clearly show that projection neuron activity in response to current injections (that should not incorporate feedback inhibition through local neurons) was altered after serotonin injection. Therefore, the observed changes in the odor-evoked neural ensemble activity should incorporate modulation at both individual PN level and at the network level. We have added this to our discussion as well.

      Finally, the authors claim that serotonin injection can mimic the starved state behavioral response. However, this is only shown for one of the four odors that are tested for behavior (HEX), thus the data does not support this claim.

      We note that Hex is the only appetitive odorant in the panel. But, as reviewer 1 has also brought up a similar point, we have toned down our claims and will investigate this carefully in a future study.

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      • Was the POR of the locusts towards linalool and ammonium higher than towards a blank odor cartridge? I ask because the locusts appear to be less likely to respond to these odors and so I am concerned that this assay is not relevant to the ecological context of these odors. In other words, perhaps serotonin did not enhance the responses to these odors in this assay, because this is not a context in which locusts would normally respond to these odors.

      The POR response to linalool and ammonium is lower and comparable to that of paraffin oil. Serotonin does not increase POR responses to paraffin oil but does increase response to hexanol (an appetitive odorant). We have clarified this using new data (Author response image 5).

      • It seems to me that Figure 5C is the crux for understanding the potential impact of 5-HT on odor coding, but it is somewhat confusing and underutilized. Is the implication that 5-HT decorrelates spontaneous activity such that when an odor stimulus arrives, the odor-evoked activity deviates to a greater degree? The authors make claims about this figure that require the reader to guess as to the aspect of the figure to which they are referring.

      The reviewer makes an astute observation. Yes, the spontaneous activity in the antennal lobe network before serotonin introduction is not correlated with the ensemble spontaneous activity after serotonin bath application. Remarkably, the odor-evoked responses were highly similar, both in the reduced PCA space and when assayed using high-dimensional ensemble neural activity vectors. Whether the changes in network spontaneous activity have a function in odor detection and recognition is not fully understood and cannot be convincingly answered using our data. But this is something that we had pondered.

      • The modeling component summarized in Figure 6 needs clarification and more detail. Perhaps example traces associated with positive weighting within neural ensemble 1 relative to neural ensemble 2? I struggled to understand conceptually how the model resolved the theoretical discrepancy between physiology and behavior.

      As recommended, here is a plot showing the responses of four PNs that had positive weights to hexanol and linalool. As can be expected, each PN in this group had higher responses to hexanol and no response to linalool. Further, the four PNs that received negative weights had response only to linalool.

      Author response image 5.

      Odor-evoked responses of four PNs that received positive weights in the model (top panel), and four PNs that were assigned negative weights in the model (bottom).

      • Was there a significant difference between the PORs of hungry vs. fed locusts? The authors state that they differ and provide statistics for the comparisons to locusts injected with 5-HT, but then don't provide any statistical analyses of hungry vs. fed animals.

      The POR responses to HEX (an appetitive odorant) were significantly different between the hungry and starved locusts.

      Author response image 6.

      A bar plot summarizing PORs to all four odors for satiated locust (highlighted with stripes), before (dark shade), and after 5HT injection (lighter shade). To allow comparison before 5HT injection for starved locust plotted as well (without stripes). The significance was determined using a one-tailed paired-sample ttest(*p<0.05).

      • Were any of the effects of 5-HT on odor-evoked PN responses significant? No statistics are provided.

      We examined the distribution of odor-evoked responses in PNs before and after 5HT introduction. We found that the overall distribution was not significantly different between the two (one-tailed pairedsample t-test; p = 0.93).

      Author response image 7.

      Comparison of the distribution of odor-evoked PN responses before (green) and after (purple) 5HT introduction. One-tailed paired sample t-test was used to compare the two distributions.

      • The authors interchangeably use "serotonin", "5HT" and "5-HT" throughout the manuscript, but this should be consistent.

      This has been fixed in the revised manuscript.

      • On page 2 the authors provide an ecological relevance for linalool as being an additive in pesticides, however, linalool is a common floral volatile chemical. Is the implication that locusts have learned to associate linalool with pesticides?

      Linalool is a terpenoid alcohol that has a floral odor but has also been used as a pesticide and insect repellent [Beier et al., 2014]. As shown in Author response image 2, it evoked the least POR responses amongst a diverse panel of 22 odorants that were tested. We have clarified how we chose odorants based on the prior dataset in the Methods section.

      • In Figure 1, there should be a legend in the figure itself indicating that the black box indicates the absence of POR and the white box indicates presence, rather than just having it in the legend text.

      Done.

      • In Figure 2, the raw data from each animal can be moved to the supplements. The way it is presented is overwhelming and the order of comparisons is difficult to follow.

      Done.

      • For the induction of bursting in PNs by the application of 5-HT, were there any other metrics observed such as period, duration of bursts, or peak burst frequency? The authors rely on ISI, but there are other bursting metrics that could also be included to understand the nature of this observation. In particular, whether the bursts are likely due to changes in intrinsic biophysical properties of the PNs or polysynaptic effects.

      We could use other metrics as the reviewer suggests. Our main point is that the spontaneous activity of individual PNs changed. We have added a new current-injection experiments to show that the PNs output to square pulses of current becomes different after serotonin application (Author response image 1)

      • Were 4-vinyl anisole, 1-nonanol, and octanoic acid selected as additional odors because they had particular ecological relevance, or was it for the diversity of chemical structure?

      These odorants were selected based on both, chemical structure and ecological relevance. The logic behind this was to have a very diverse odor panel that consisted of food odorant – Hexanol, aggregation pheromone – 4-vinyl anisole, sex pheromone – benzaldehyde, acid – octanoic acid, base – ammonium, and alcohol – 1-nonanol. Additionally, we selected these odors based on previous neural and behavioral data on these odorants (Chandak and Raman, 2023, Traner and Raman, 2023, Nizampatnam et al, 2022 & 2018; Saha et al., 2017 & 2013).

      Reviewer #2 (Recommendations For The Authors):

      The electrophysiology dataset combines all performed experiments across all tested different PN-odor pairs. How many odors have been tested in a single PN and how many PNs have been tested for a single odor? This information is not present in the current manuscript. Can the authors exclude that there are odor-specific modulations?

      In total, our dataset includes recordings from 19 PNs. Seven PNs were tested on a panel of seven odorants (4-vinyl anisole, 1-nonanol, octanoic acid, Hex, Bza, Lool, and Amn), and the remaining twelve were tested with the four main odorants used in the study (Hex, Bza, Lool, and Amn). This information has been added to the Methods section

      How did the authors choose the concentrations of serotonin injections and bath applications - is this a naturalistic amount?

      The serotonin concentration for ephys experiments was chosen based on trial-error experiments:

      0.01mM was the highest concentration that did not cause cell death. For the behavioral experiments, we increased the concentration (0.1 M) due to the presence of anatomical structures in the locust's head such as air sacks, sheath as well as hemolymph which causes some degree of dilution that we cannot control.

      Behavior experiments were performed 3 hours after injection - ephys experiments 5-10 minutes following bath application. Can the authors exclude that serotonin affects neural processing differently on these different timescales?

      We cannot exclude this possibility. We did ePhys experiments 5-10 minutes after bath application as it would be extremely hard to hold cells for that long.

      A longer delay was required for our behavioral experiments as the locusts tended to be a bit more agitated with larger spontaneous movements of palps as well as exhibited unprompted vomiting. A 3hour period allowed the locust to regain its baseline level movements after 5HT introduction. [This information has been added to the methods section of the revised manuscript]

      Concerning the analysis of electrophysiological data. The authors should correct for changes in the baseline before performing PCA analysis. And how much of the variance is explained by PC1 and PC2?

      We did not correct for baseline changes or subtract baseline as we wanted to show that the odor-evoked neural responses still robustly encoded information about the identity of the odorant.

      The authors should perform dye injections after recordings to visualize the cell type they recorded from. Serotonin might affect also other cell types in the antennal lobe.

      As mentioned above, in the locust antennal lobe only PNs fire full-blown sodium spikes, and LNs only fire calcium spikelets (Author response image 4). Since these signals are small, they will be buried under the noise floor when using extracellular recording electrodes for monitoring responses in the AL antennal lobe.

      Hence we are pretty certain what type of cells we are recording from.

      There were several typos in the manuscript, please check again.

      We have fixed many of the grammatical errors and typos in the revised version.

    1. Author response:

      The following is the authors’ response to the original reviews

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      Most studies in sensory neuroscience investigate how individual sensory stimuli are represented in the brain (e.g., the motion or color of a single object). This study starts tackling the more difficult question of how the brain represents multiple stimuli simultaneously and how these representations help to segregate objects from cluttered scenes with overlapping objects.

      Strengths

      The authors first document the ability of humans to segregate two motion patterns based on differences in speed. Then they show that a monkey's performance is largely similar; thus establishing the monkey as a good model to study the underlying neural representations.

      Careful quantification of the neural responses in the middle temporal area during the simultaneous presentation of fast and slow speeds leads to the surprising finding that, at low average speeds, many neurons respond as if the slowest speed is not present, while they show averaged responses at high speeds. This unexpected complexity of the integration of multiple stimuli is key to the model developed in this paper.

      One experiment in which attention is drawn away from the receptive field supports the claim that this is not due to the involuntary capture of attention by fast speeds.

      A classifier using the neuronal response and trained to distinguish single-speed from bi-speed stimuli shows a similar overall performance and dependence on the mean speed as the monkey. This supports the claim that these neurons may indeed underlie the animal's decision process.

      The authors expand the well-established divisive normalization model to capture the responses to bi-speed stimuli. The incremental modeling (eq 9 and 10) clarifies which aspects of the tuning curves are captured by the parameters.

      We thank the Reviewer for the thorough summary of the findings and supportive comments.

      Weaknesses

      While the comparison of the overall pattern of behavioral performance between monkeys and humans is important, some of the detailed comparisons are not well supported by the data. For instance, whether the monkey used the apparent coherence simply wasn't tested and a difference between 4 human subjects and a single monkey subject cannot be tested statistically in a meaningful manner. I recommend removing these observations from the manuscript and leaving it at "The difference between the monkey and human results may be due to species differences or individual variability" (and potentially add that there are differences in the task as well; the monkey received feedback on the correctness of their choice, while the humans did not.)

      Thanks for the suggestion. We agree and have modified the text accordingly. We now state on page 8, lines 189-191, "The difference between the monkey and human results may be due to species differences or individual variability. The differences in behavioral tasks may also play a role – the monkey received feedback on the correctness of the choice, whereas human subjects did not."

      A control experiment aims to show that the "fastest speed takes all" behavior is general by presenting two stimuli that move at fast/slow speeds in orthogonal directions. The claim that these responses also show the "fastest speed takes all" is not well supported by the data. In fact, for directions in which the slow speed leads to the largest response on its own, the population response to the bi-speed stimulus is the average of the response to the components (This is fine. One model can explain all direction tuning curve, which also explain averaging at the slower speed stronger directions). Only for the directions where the fast speed stimulus is the preferred direction is there a bias towards the faster speed (Figure 7A). The quantification of this effect in Figure 7B seems to suggest otherwise, but I suspect that this is driven by the larger amplitude of Rf in Figure 8, and the constraint that ws and wf are constant across directions. The interpretation of this experiment needs to be reconsidered.

      The Reviewer raised a good question. Our model with fixed weights for faster and slower components across stimulus directions provided a parsimonious explanation for the whole tuning curve, regardless of whether the faster component elicited a stronger response than the slower component. Because the model can be well constrained by the measured direction-tuning curves, we did not restrain 𝑤 and 𝑤 to sum to one, which is more general. The linear weighted summation (LWS) model fits the neuronal responses to the bi-speed stimuli very well, accounting for an average of 91.8% (std = 7.2%) of the response variance across neurons. As suggested by the Reviewer, we now use the normalization model to fit the data with fixed weights across all motion directions. The normalization model also provides a good fit, accounting for an average of 90.5% (std = 7.1%) of the response variance across neurons.

      Note that in the new Figure 8A, at the left side of the tuning curve (i.e., at negative vector average (VA) directions), where the slower component moving in a more preferred direction of the neurons than the faster component, the bi-speed response (red curve) is slightly lower than the average of the component response (gray curve), indicating a bias toward the weaker faster component. Therefore, the faster speed bias does not occur only when the faster component moves in the more preferred direction. This can also be seen in the direction-tuning curves of an example neuron that we added to the figure (new Fig. 8B). The peak responses to the slower and faster component were about the same, but the neuron still showed a faster-speed bias. At negative VA directions, the red curve is lower than the response average (gray curve) and is biased toward the weaker (faster) component.  

      The faster-speed bias also occurs when the peak response to the slower component is stronger than the faster component. As a demonstration, Author response image 1 1 shows an example MT neuron that has a slow preferred speed (PS = 1.9 deg/s) and was stimulated by two speeds of 1.2 and 4.8 deg/s. The peak response to the faster component (blue) was weaker than that to the slower component (green). However, this neuron showed a strong bias toward the faster component. A normalization model fit with fixed weights for the faster and slower components (black curve) described the neuronal response to both speeds (red) well. This neuron was not included in the neuron population shown in Figure 8 because it was not tested with stimulus speeds of 2.5 and 10 deg/s.

      Author response image 1.

      An example MT neuron was tested with stimulus speeds of 1.2 and 4.8 deg/s. The preferred speed of this neuron was 1.9 deg/s. Fixed weights of 0.59 for the faster component and 0.12 for the slower component described the responses to the bispeed stimuli well using a normalization model. The neuron showed a faster-speed bias although its peak response to the slower component was higher than that of the faster component.

      We modified the text to clarify these points:

      Page 19, lines 405 – 410, “The bi-speed response was biased toward the faster component regardless of whether the response to the faster component was stronger (in positive VA directions) or weaker (in negative VA directions) than that to slower component (Fig. 8A). The result from an example neuron further demonstrated that, even when the peak firing rates of the faster and slower component responses were similar, the response elicited by the bi-speed stimuli was still biased toward the faster component (Fig. 8B). ”

      Page 19, lines 421 – 427, “Because the model can be well constrained by the measured direction-tuning curves, it is not necessary to require 𝑤 and 𝑤 to sum to one, which is more general. An implicit assumption of the model is that, at a given pair of stimulus speeds, the response weights for the slower and faster components are fixed across motion directions. The model fitted MT responses very well, accounting for an average of 91.8% of the response variance (std = 7.2%, N = 21) (see Methods). The success of the model supports the assumption that the response weights are fixed across motion directions.”

      Reviewer #2 (Public Review):

      Summary:

      This is a paper about the segmentation of visual stimuli based on speed cues. The experimental stimuli are random dot fields in which each dot moves at one of two velocities. By varying the difference between the two speeds, as well as the mean of the two speeds, the authors estimate the capacity of observers (human and non-human primates) to segment overlapping motion stimuli. Consistent with previous work, perceptual segmentation ability depends on the mean of the two speeds. Recordings from area MT in monkeys show that the neuronal population to compound stimuli often shows a bias towards the faster-speed stimuli. This bias can be accounted for with a computational model that modulates single-neuron firing rates by the speed preferences of the population. The authors also test the capacity of a linear classifier to produce the psychophysical results from the MT data.

      Strengths:

      Overall, this is a thorough treatment of the question of visual segmentation with speed cues. Previous work has mostly focused on other kinds of cues (direction, disparity, color), so the neurophysiological results are novel. The connection between MT activity and perceptual segmentation is potentially interesting, particularly as it relates to existing hypotheses about population coding.

      We thank the Reviewer for the summary and comments.

      Weaknesses:

      Page 10: The relationship between (R-Rs) and (Rf-Rs) is described as "remarkably linear". I don't actually find this surprising, as the same term (Rs) appears on both the x- and y-axes. The R^2 values are a bit misleading for this reason.

      The Reviewer is correct that subtracting a common term Rs from R and Rf would introduce correlation between (R-Rs) and (Rf-Rs). To address this concern, we conducted an additional analysis. We showed that, at most speed pairs, the R^2 values between (R-Rs) and (Rf-Rs) based on the data are significantly higher than the R^2 values between (R’-Rs) and (RfRs), in which R’ was a random combination of Rs and Rf. Since the same Rs was commonly subtracted in calculating R^2 (data) and R^2 (simulation), the difference between R^2 (data) and R^2 (simulation) suggests that the response pattern of R contributes to the additional correlation.

      We now acknowledge this confounding factor and describe the new analysis results on page 14, lines 309 – 326. Please also see the response to Reviewer 3 about a similar concern.

      Figure 9: I'm confused about the linear classifier section of the paper. The idea makes sense - the goal is to relate the neuronal recordings to the psychophysical data. However the results generally provide a poor quantitative match to the psychophysical data. There is mention of a "different paper" (page 26) involving a separate decoding study, as well as a preprint by Huang et al. (2023) that has better decoding results. But the Huang et al. preprint appears to be identical to the current manuscript, in that neither has a Figure 12, 13, or 14. The text also says (page 26) that the current paper is not really a decoding study, but the linear classifier (Figure 9F) is a decoder, as noted on page 10. It sounds like something got mixed up in the production of two or more papers from the same dataset.

      We apologize for the confusion regarding the reference of Huang et al. (2023, bioRxiv). We referred to an earlier version of this bioRxiv manuscript (version 1), which included decoding analysis. In the bibliography, we provided two URLs for this pre-print. While the second link was correct, the first URL automatically links to the latest version (version 2), which did not have the abovementioned decoding analysis.

      The analysis in Figure 9 is to apply a classifier to discriminate two-speed from singlespeed stimuli, which is a decoding analysis as the Reviewer pointed out. We revised the result section about the classifier to make it clear what the classifier can and cannot explain (pages 2223, lines 516-534). We also included a sentence at the end of this section that leads to additional decoding analysis to extract motion speed(s) from MT population responses (page 23, lines 541543), “To directly evaluate whether the population neural responses elicited by the bi-speed stimulus carry information about two speeds, it is important to conduct a decoding analysis to extract speed(s) from MT population responses.”

      In any case, I think that some kind of decoding analysis would really strengthen the current paper by linking the physiology to the psychophysics, but given the limitations of the linear classifier, a more sophisticated approach might be necessary -- see for example Zemel, Dayan, and Pouget, 1998. The authors might also want to check out closely related work by Treue et al. (Nature Neuroscience 2000) and Watamaniuk and Duchon (1992).

      We thank the Reviewer for the suggestion and agree that it is useful to incorporate additional decoding analysis that can better link physiology results to psychophysics. The decoding analysis we conducted was motivated by the framework proposed by Zemel, Dayan, and Pouget (1998), and also similar to the idea briefly mentioned in the Discussion of Treue et al. (2000). We have added the decoding analysis to this paper on pages 25-32.  

      What do we learn from the normalization model? Its formulation is mostly a restatement of the results - that the faster and slower speeds differentially affect the combined response. This hypothesis is stated quantitatively in equation 8, which seems to provide a perfectly adequate account of the data. The normalization model in equation 10 is effectively the same hypothesis, with the mean population response interposed - it's not clear how much the actual tuning curve in Figure 10A even matters, since the main effect of the model is to flatten it out by averaging the functions in Figure 10B. Although the fit to the data is reasonable, the model uses 4 parameters to fit 5 data points and is likely underconstrained; the parameters other than alpha should at least be reported, as it would seem that sigma is actually the most important one. And I think it would help to examine how robust the statistical results are to different assumptions about the normalization pool.

      In the linear weighted summation model (LWS) model (Eq. 8), the weights Ws and Wf are free parameters. We think the value of the normalization model (Eq. 9) is that it provides an explanation of what determines the response weights. We agree with the Reviewer that using the normalization model (Eq. 9) with 4 parameters to fit 5 data points of the tuning curves to bispeed stimuli of individual neurons is under-constrained. We, therefore, removed the section using the normalization model to fit overlapping stimuli moving in the same direction at different speeds.

      A better way to constrain the normalization model is to use the full direction-tuning curves of MT neurons in response to two stimulus components moving in different directions at different speeds, as shown in Figure 8. We now use the normalization model (Eq. 9) to fit this data set (also suggested by Reviewer 1), in addition to the LWS model. We now report the median values of the model parameters of the normalization model, including the exponent n, sigma, alpha, and the constant c. We also compared the normalization model fit with the linear summation (LWS) model. We discuss the limitations of our data set and what needs to be done in future studies. The revisions are on page 20, lines 434-467 in the Results, and pages 34-35, lines 818-829 in Discussion.

      Reviewer #3 (Public Review):

      Summary:

      This study concerns how macaque visual cortical area MT represents stimuli composed of more than one speed of motion.

      Strengths:

      The study is valuable because little is known about how the visual pathway segments and preserves information about multiple stimuli. The study presents compelling evidence that (on average) MT neurons represent the average of the two speeds, with a bias that accentuates the faster of the two speeds. An additional strength of the study is the inclusion of perceptual reports from both humans and one monkey participant performing a task in which they judged whether the stimuli involved one vs two different speeds. Ultimately, this study raises intriguing questions about how exactly the response patterns in visual cortical area MT might preserve information about each speed, since such information could potentially be lost in an average response as described here, depending on assumptions about how MT activity is evaluated by other visual areas.

      Weaknesses:

      My main concern is that the authors are missing an opportunity to make clear that the divisive normalization, while commonly used to describe neural response patterns in visual areas (and which fits the data here), fails on the theoretical front as an explanation for how information about multiple stimuli can be preserved. Thus, there is a bit of a disconnect between the goal of the paper - how does MT represent multiple stimuli? - and the results: mostly averaging responses which, while consistent with divisive normalization, would seem to correspond to the perception of a single intermediate speed. This is in contrast to the psychophysical results which show that subjects can at least distinguish one from two speeds. The paper would be strengthened by grappling with this conundrum in a head-on manner.

      We thank the Reviewer for the constructive comments. We agree with the Reviewer that it is important to connect the encoding of multiple speeds with the perception. The Reviewer also raised an important question regarding whether multiple speeds can be extracted from population neural responses, given the encoding rules characterized in this study.

      It is a hard problem to extract multiple stimulus values from the population neural response. Inspired by the theoretical framework proposed by Zemel et al. (1998), we conducted a detailed decoding study to extract motion speed(s) from MT population responses. We used the decoded speed(s) to perform a discrimination task similar to our psychophysics task and compared the decoder's performance with perception. We found that, at X4 speed difference, we could decode two speeds based on MT response, and the decoder's performance was similar to that of perception. However, at X2 speed difference, except at the slowest speeds of 1.25 and 2.5 deg/s, the decoder cannot extract two speeds and cannot differentiate between a bi-speed stimulus and a single log-mean speed stimulus. We have added the decoding analysis to this paper on pages 25-32. We also discuss the implications and limitations of these results (pages 35-36, lines 852-884).

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      Classifier:

      One question I have is how the classifier's performance scales with the number of neurons used in the analysis. Here that number is set to the number that was recorded, but it is a free parameter in this analysis. Why does the arbitrary choice of 100 neurons match the animals' performance?

      We apologize for the unclearness of this point. The decoding using the classifier was based on the neural responses of 100 recorded MT neurons in our data set. The number of 100 neurons was not a free parameter. We need to reconstruct the population neural response based on the responses of the recorded neurons and their preferred speeds (red and black dots in Figure 9A-E).  

      We spline-fitted the reconstructed population neural responses (red and black curves in Figure 9-E). One way to change the number of neurons used for the decoding is to resample N points along the spline-fitted population responses, using N as a free parameter. However, we think it is better to conduct decoding based on the responses from the recorded neurons rather than based on interpolated responses. We now clarify on page 22, lines 520-522, that we based on the responses of the 100 recorded neurons in our dataset to do the classification (decoding).

      Normalization Model:

      Although the model is phenomenological, a schematic circuit diagram could help the reader understand how this could work (I think this is worthwhile even though the data cannot distinguish among different implementations of divisive normalization).

      Thanks for this suggestion. We agree that a circuit diagram would help the readers understand how the model works. However, as the Reviewer pointed out, our data cannot distinguish between different implementations of the model. For example, divisive normalization can occur on the inputs to MT neurons or on MT neurons themselves. The circuit mechanism of weighting the component responses is not clear either. A schematic circuit diagram then mainly serves to recapitulate the normalization model in Equation 9. We, therefore, choose not to add a schematic circuit diagram at this time. We are interested in developing a circuit model to account for how visual neurons represent multiple stimuli in future studies.

      Another suggestion is that the time courses could be used to constrain the model; the fact that it takes a while after the onset of the slow-speed response for averaging to reveal itself suggests the presence of inertia/hysteresis in the circuit).

      We agree that the time course of MT responses could be used to constrain the model. This is also why we think it is important to document the time course in this paper. We now state in the Results, page 17, lines 354-357:

      “At slow speeds, the very early faster-speed bias suggests a likely role of feedforward inputs to MT on the faster-speed bias. The slightly delayed reduction (normalization) in the bispeed response relative to the stronger component response also helps constrain the circuit model for divisive normalization.”

      Two-Direction Experiment:

      Applying the normalization model to this dataset could help determine its generality.

      This is a good point. We now apply the normalization model (Eq. 9) to fit this data set with the full direction tuning curves in response to two stimuli moving in different directions at different speeds. Please also see the response to Reviewer 2 about the normalization model fit.

      The results of the normalization model fit are now described on page 20 and Figure 8A, B, D.

      Reviewer #2 (Recommendations For The Authors):

      In terms of impact, I would say that the presentation is geared largely toward people who go to VSS. To broaden the appeal, the authors might consider a more general formulation of the four hypotheses stated at the bottom of page 3. These are prominent ideas in systems neuroscience - population encoding, Bayesian inference, etc.

      We thank the Reviewer for the suggestion. We have revised the Introduction accordingly on pages 3-4, lines 43-69. Please also see the response to Reviewer 3 about the Introduction.

      Figure 5: It might be helpful to show the predictions for different hypotheses. If the response to the transparent stimulus is equal to that of the faster stimulus, you will have a line with slope 1. If it is equal to the response to the slow stimulus, all points will lie on the x-axis. In between you get lines with slopes less than 1.

      In Figures 5F1 and 5F2, we show dotted lines indicating faster-all (i.e., faster-componenttake-all), response averaging, and slower-all (i.e., slower-component-take-all) on the X-axis. We show those labels in between Figs. 5F1 and F2.

      Figure 6: The analysis is not motivated by any particular question, and the results are presented without any quantitation. This section could be better motivated or else removed.

      We now better motivate the section about the response time course on page 16, lines 336 – 339: “The temporal dynamics of the response bias toward the faster component may provide a useful constraint on the neural model that accounts for this phenomenon. We therefore examined the timecourse of MT response to the bi-speed stimuli. We asked whether the faster-speed bias occurred early in the neuronal response or developed gradually.”

      On page 17, lines 354-357, we also state that “At slow speeds, the very early faster-speed bias suggests a likely role of feedforward inputs to MT on the faster-speed bias. The slightly delayed reduction (normalization) in the bi-speed response relative to the stronger component response also helps constrain the circuit model for divisive normalization.”

      Equation (9): There appears to be an "S" missing in the denominator.

      We double-checked and did not see a missing "S" in Equation 9, on page 20.  

      Reviewer #3 (Recommendations For The Authors):

      This is an impressive study, with the chief strengths being the computational/theoretical motivation and analyses and the inclusion of psychophysics together with primate neurophysiology. The manuscript is well-written and the figures are clear and convincing (with a couple of suggestions detailed below).

      We thank the Reviewer for the comments.

      Specific suggestions:

      (1) Intro para 3

      "It is conceivable that the responses of MT neurons elicited by two motion speeds may follow one of the following rules: (1) averaging the responses elicited by the individual speed components; (2) bias toward the speed component that elicits a stronger response, i.e. "soft-max operation" (Riesenhuber and Poggio, 1999); (3) bias toward the slower speed component, which may better represent the more probable slower speeds in nature scenes (Weiss et al., 2002); (4) bias toward the faster speed component, which may benefit the segmentation of a faster-moving stimulus from a slower background."

      This would be a good place to point out which of these options is likely to preserve vs. lose information and how.

      It seems to me that only #2 is clearly information-preserving, assuming that there are neurons with a variety of different speed preferences such that different neurons will exhibit different "winners". #1 would predict subjects would perceive only an intermediate speed, whereas #3 would predict perceiving only/primarily the slower speed and #4 would predict only/primarily perceiving the faster speed.

      The difference between "only" and "primarily" would depend on whether the biases are complete or only partial. I acknowledge that the behavioral task in the study is not a "report all perceived speeds" task, but rather a 1 vs 2 speeds task, so the behavioral assay is not a direct assessment of the question I'm raising here, but I think it should still be possible to write about the perceptual implications of these different possibilities for encoding in an informative way.

      Thanks for the suggestions. We have revised this paragraph in the Introduction on pages 3 – 4, lines 43 – 69.

      (2) Analysis clarifications

      The section "Relationship between the responses to bi-speed stimuli and constituent stimulus components" could use some clarification/rearrangement/polish. I had to read it several times. Possibly, rearrangement, simplification/explanation of nomenclature, and building up from a simpler to a more complex case would help. If I understand correctly, the outcome of the analysis is to obtain a weight value for every combination of slow and fast speeds used. The R's in equation 5 are measured responses, observed on the single stimulus and combined stimulus trials. It was not clear to me if the R's reflect average responses or individual trial responses; this should be clarified. Ws = 1- wf so in essence only 1 weight is computed for each combination. Then, in the subsequent sections of the manuscript, the authors explore whether the weight computed for each stimulus combination is the same or does it vary across conditions. If I have this right, then walking through these steps will aid the reader.

      The Reviewer is correct. We now walk through these steps and better state the rationale for this approach. The R's in Equation 5 are trial-averaged responses, not trial-by-trial responses.

      We have clarified these points on page 13.

      To take a particular example, the sentence "Using this approach to estimate the response weights for individual neurons can be inaccurate because, at each speed pair, the weights are determined only by three data points" struck me as a rather backdoor way to get at the question. Is the estimate noisy? Or does the weighting vary systematically across speeds? I think the authors are arguing the latter; if so, it would be valuable to say so.

      We wanted to estimate the weighting for each speed pair and determine whether the weights change with the stimulus speeds. Indeed, we found that the weights change systematically across speed pairs. The issue was not because the estimate was noisy (see below in response to the second paragraph for point 3.  

      We have clarified this point in the text, on page 13, lines 273 – 280: “Our goal was to estimate the weights for each speed pair and determine whether the weights change with the stimulus speeds. In our main data set, the two speed components moved in the same direction. To determine the weights of 𝑤 and w<sub>f</sub> for each neuron at each speed pair, we have three data points R, R<sub>s</sub>, and R<sub>f</sub>, which are trial-averaged responses. Since it is not possible to solve for both variables, 𝑤 and w<sub>f</sub>, from a single equation (Eq. 5) with three data values, we introduced an additional constraint: 𝑤 + w<sub>f</sub> =1. While this constraint may not yield the exact weights that would be obtained with a fully determined system, it nevertheless allows us to characterize how the relative weights vary with stimulus speed.”

      (3) Figure 5

      Related to the previous point, Figures 5A-E are subject to a possible confound. When plotting x vs y values, it is critical that the x and y not depend trivially on the same value. Here, the plots are R-Rs and Rf-Rs. Rs, therefore, is contained in both the x and y values. Assume, for the sake of argument, that R and Rf are constants, whereas Rs is drawn from a distribution of random noise. When Rs, by chance, has an extreme negative value, R-Rs and Rf-Rs will be large positive values. The solution to this artificial confound is to split the trials that generate Rs into two halves and subtract one half from R and the other half from Rf. Then, the same noisy draw will not be contributing to both x and y. The above is what is needed if the authors feel strongly about including this analysis.

      The Reviewer is correct that subtracting a common term (Rs) would introduce a correlation between (R-Rs) and (Rf-Rs) (Reviewer 2 also raised this point). R's in Equations 5, 6, 7 (and Figure 5A-E) are trial-averaged responses. So, we cannot address the issue by dividing R’s into two halves. Our results showed that the regression slope (W<sub>f</sub>) changed from near 1 to about 0.5 as the stimulus speeds increased, and the correlation coefficient between (R – Rs) and (R<sub>f</sub> – Rs) was high at slow stimulus speeds. To determine whether these results can be explained by the confounding factor of subtracting a common term Rs, rather than by the pattern of R in representing two speeds, we did an additional analysis. We acknowledged the issue and described the new analysis on page 13, lines 303 – 326:

      “Our results showed that the bi-speed response showed a strong bias toward the faster component when the speeds were slow and changed progressively from a scheme of ‘fastercomponent-take-all’ to ‘response-averaging’ as the speeds of the two stimulus components increased (Fig. 5F1). We found similar results when the speed separation between the stimulus components was small (×2), although the bias toward the faster component at low stimulus speeds was not as strong as x4 speed separation (Fig. 5A2-F2 and Table 1).  

      In the regression between (𝑅 – 𝑅<sub>s</sub>) and (𝑅<sub>f</sub> – 𝑅<sub>s</sub>), 𝑅<sub>s</sub> was a common term and therefore could artificially introduce correlations. We wanted to determine whether our estimates of the regression slope (𝑤<sub>f</sub>) and the coefficient of determination (𝑅<sup>2</sup>) can be explained by this confounding factor. At each speed pair and for each neuron from the data sample of the 100 neurons shown in Figure 5, we simulated the response to the bi-speed stimuli (𝑅 <sub>e</sub>) as a randomly weighted sum of 𝑅<sub>f</sub> and 𝑅<sub>s</sub> of the same neuron.

      𝑅<sub>e</sub> = 𝑎𝑅<sub>f</sub> + (1 − 𝑎)𝑅<sub>s</sub>,

      in which 𝑎 was a randomly generated weight (between 0 and 1) for 𝑅<sub>f</sub>, and the weights for 𝑅<sub>f</sub> and 𝑅<sub>s</sub> summed to one. We then calculated the regression slope and the correlation coefficient between the simulated 𝑅<sub>e</sub> - 𝑅<sub>s</sub> and 𝑅<sub>f</sub> - 𝑅<sub>s</sub> across the 100 neurons. We repeated the process 1000 times and obtained the mean and 95% confidence interval (CI) of the regression slope and the 𝑅<sup>2</sup>. The mean slope based on the simulated responses was 0.5 across all speed pairs. The estimated slope (𝑤<sub>f</sub>) based on the data was significantly greater than the simulated slope at slow speeds of 1.25/5, 2.5/10 (Fig. 5F1), and 1.25/2.5, 2.5/5, and 5/10 degrees/s (Fig. 5F2) (bootstrap test, see p values in Table 1). The estimated 𝑅<sup>2</sup> based on the data was also significantly higher than the simulated 𝑅<sup>2</sup> for most of the speed pairs (Table 1). These results suggest that the faster-speed bias at the slow stimulus speeds and the consistent response weights across the neuron population at each speed pair are not analysis artifacts.”

      However, I don't see why the analysis is needed at all. Can't Figure 5F be computed on its own? Rather than computing weights from the slopes in 5A-E, just compute the weights from each combination of stimulus conditions for each neuron, subject to the constraint ws=1-wf. I think this would be simpler to follow, not subject to the noise confound described in the previous point, and likely would make writing about the analysis easier.

      We initially tried the suggested approach to determine the weights of the individual neurons. The weights from each speed combination for each neuron are calculated by:  𝑤<sub>s</sub> = , 𝑤<sub>f</sub> , and 𝑤<sub>s</sub> and 𝑤<sub>f</sub> sum to 1. 𝑅, 𝑅<sub>f</sub> and  𝑅<sub>s</sub> are the responses to the same motion direction. Using this approach to estimate response weights for individual neurons can be unreliable, particularly when 𝑅<sub>f</sub> and 𝑅<sub>s</sub> are similar. This situation often arises when the two speeds fall on opposite sides of the neuron's preferred speed, resulting in a small denominator (𝑅<sub>f</sub> - 𝑅<sub>s</sub>) and, consequently, an artificially inflated weight estimate. We therefore used an alternative approach. We estimated the response weights for the neuronal population at each speed pair (𝑅<sub>f</sub> - 𝑅<sub>s</sub>) using linear regression of (𝑅 - 𝑅<sub>s</sub>) against (𝑅<sub>f</sub> - 𝑅<sub>s</sub>). The slope is the weight for the faster component for the population. This approach overcame the difficulty of determining the response weights for single neurons.

      Nevertheless, if the data provide better constraints, it is possible to estimate the response weights for each speed pair for individual neurons. For example, we can calculate the weights for single neurons by using stimuli that move in different directions at two speeds. By characterizing the full direction tuning curves for R, R<sub>f</sub>, and Rs, we have sufficient data to constrain the response weights for single neurons, as we did for the speed pair of 2.5 and 10º/s in Figure 8. In future studies, we can use this approach to measure the response weights for single neurons at different speed pairs and average the weights across the neuron population.  

      We explain these considerations in the Results (pages 13–14, lines 265-326) and Discussion (pages 34-35, lines 818-829).

      (4) Figure 7

      Bidirectional analysis. It would be helpful to have a bit more explanation for why this analysis is not subject to the ws=1-wf constraint. In Figure 7B, a line could be added to show what ws + wf =1 would look like (i.e. a line with slope -1 going from (0,1) to (1,0); it looks like these weights are a little outside that line but there is still a negative trend suggesting competition.

      For the data set when visual stimuli move in the same direction at different speeds, we included a constraint that W<sub>s</sub> and W<sub>f</sub> sum to 1. This is because one cannot solve two independent variables (Ws and Wf) using one equation R = W<sub>s</sub> · R<sub>s</sub> + W<sub>f</sub> R<sub>f</sub>, with three data values (R, Rs, Rf).

      In the dataset using bi-directional stimuli (now Fig. 8), we can use the full direction tuning curves to constrain the linear weighted (LWS) summation model and the normalization model. So, we did not need to impose the additional constraint that Ws and Wf sum to one, which is more general. We now clarify this in the text, on page 19, lines 421-423.

      As suggested, we added a line showing Ws + Wf = 1 for the LWS model fit (Fig. 8C) and the normalization model fit (Fig. 8D) (also see page 21, lines 482-484). Although 𝑤 and 𝑤 are not constrained to sum to one in the model fits, the fitted weights are roughly aligned with the dashed lines of Ws + Wf = 1.

      (5) Attention task

      General wording suggestions - a caution against using "attention" as a causal/mechanistic explanation as opposed to a hypothesized cognitive state. For example, "We asked whether the faster-speed bias was due to bottom-attention being drawn toward the faster stimulus component". This could be worded more conservatively as whether the bias is "still present if attention is directed elsewhere" - i.e. a description of the experimental manipulation.

      We intended to test the hypothesis of whether the faster-speed bias can be explained by attention automatically drawn to the faster component and therefore enhance the contribution of the faster component to the bi-speed response. We now state it as a possible explanation to be tested. We changed the subtitle of this section to be more conservative: “Faster-speed bias still present when attention was directed away from the RFs”, on page 18, line 363.

      We also modified the text on page 18, lines 364-367: “One possible explanation for the faster-speed bias may be that bottom-up attention is drawn toward the faster stimulus component, enhancing the response to the faster component. To address this question, we asked whether the faster-speed bias was still present if attention was directed away from the RFs.”

      Relatedly, in the Discussion, the section on "Neural mechanisms", the sentence "The faster-speed bias was not due to an attentional modulation" should be rephrased as something like 'the bias survived or was still present despite an attentional modulation requiring the monkey to attend elsewhere'.

      Our motivation for doing the attention-away experiment was to determine whether a bottom-up attentional modulation can explain the faster-speed bias. We now describe the results as suggested by the Reviewer. But we’d also like to interpret the implications of the results. In Discussion, page 34, lines 789-790, we now state: “We found that the faster-speed bias was still present when attention was directed away from the RFs, suggesting that the faster-speed bias cannot be explained by an attentional modulation.”  

      (6) "A model that accounts for the neuronal responses to bi-speed stimuli". This section opens with: "We showed that the neuronal response in MT to a bi-speed stimulus can be described by a weighted sum of the neuron's responses to the individual speed components". "Weighted average" would be more appropriate here, given that ws = 1-wf.

      As mentioned above, the added constraint of Ws+Wf = 1 was only a practical solution for determining the weights for the data set using visual stimuli moving in the same direction. More generally, Ws and Wf do not need to sum to one. As such, we prefer the wording of weighted sum.

      (7) "As we have shown previously using visual stimuli moving transparently in different directions, a classifier's performance of discriminating a bi-directional stimulus from a singledirection stimulus is worse when the encoding rule is response-averaging than biased toward one of the stimulus components" - this is important! Can this be worked into the Introduction?

      Yes, we now also mention this point in the Introduction regarding response averaging on page 4, lines 54-57: “While decoding two stimuli from a unimodal response is theoretically possible (Zemel et al., 1998; Treue et al., 2000), response averaging may result in poorer segmentation compared to encoding schemes that emphasize individual components, as demonstrated in neural coding of overlapping motion directions (Xiao and Huang, 2015).” Also, please see the response to point 1 above.

      (8) Minor, but worth catching now - is the use of initials for human participants consistent with best practices approved at your institution?

      Thanks for checking. The letters are not the initials of the human subjects. They are coded characters. We have clarified it in the legend of Figure 1, on page 7, line 168.

    1. Author Response

      Reviewer #1 (Public Review):

      Summary:

      In this paper, the effects of two sensory stimuli (visual and somatosensory) on fMRI responsiveness during absence seizures were investigated in GEARS rats with concurrent EEG recordings. SPM analysis of fMRI showed a significant reduction in whole-brain responsiveness during the ictal period compared to the interictal period under both stimuli, and this phenomenon was replicated in a structurally constrained whole-brain computational model of rat brains.

      The conclusion of this paper is that whole-brain responsiveness to both sensory stimuli is inhibited and spatially impeded during seizures.

      I also suggest the manuscript should be written in a way that is more accessible to readers who are less familiar with animal experiments. In addition, the implementation and interpretation of brain simulations need to be more careful and clear.

      Several sections of the manuscript were clarified and simplified to be more accessible. Also, implementation and interpretations of brain simulations were modified to be more precise.

      Strengths:

      1) ZTE imaging sequence was selected over traditional EPI sequence as the optimal way to perform fMRI experiments during absence seizures.

      2) A detailed classification of stimulation periods is achieved based on the relative position in time of the stimulation period with respect to the brain state.

      3) A whole-brain model embedded with a realistic rat connectome is simulated on the TVB platform to replicate fMRI observations.

      We thank the reviewer for indicating the strengths of our manuscript.

      Weaknesses:

      1) The analysis in this paper does not directly answer the scientific question posed by the authors, which is to explore the mechanisms of the reduced brain responsiveness to external stimuli during absence seizures (in terms of altered information processing), but merely characterizes the spatial involvement of such reduced responsiveness. The same holds for the use of mean-field modeling, which merely reproduces experimental results without explaining them mechanistically as what the authors have claimed at the head of the paper.

      We agree with the reviewer that the manuscript does not answer specifically about the mechanisms of reduced brain responsiveness. The main scientific question addressed in the manuscript was to compare whole-brain responsiveness of stimulus between ictal and interictal states. The sentence that can lead to misinterpretations in the manuscript abstract: “The mechanism underlying the reduced responsiveness to external stimulus remains unknown.” was therefore modified to the following “The whole-brain spatial and temporal characteristics of reduced responsiveness to external stimulus remains unknown”.

      2) The implementations of brain simulations need to be more specific.

      Contribution:

      The contribution of this paper is performing fMRI experiments under a rare condition that could provide fresh knowledge in the imaging field regarding the brain's responsiveness to environmental stimuli during absence seizures.

      Reviewer #2 (Public Review):

      Summary:

      This study examined the possible effect of spike-wave discharges (SWDs) on the response to visual or somatosensory stimulation using fMRI and EEG. This is a significant topic because SWDs often are called seizures and because there is non-responsiveness at this time, it would be logical that responses to sensory stimulation are reduced. On the other hand, in rodents with SWDs, sensory stimulation (a noise, for example) often terminates the SWD/seizure.

      In humans, these periods of SWDs are due to thalamocortical oscillations. A certain percentage of the normal population can have SWDs in response to photic stimulation at specific frequencies. Other individuals develop SWDs without stimulation. They disrupt consciousness. Individuals have an absent look, or "absence", which is called absence epilepsy.

      The authors use a rat model to study the responses to stimulation of the visual or somatosensory systems during and in between SWDs. They report that the response to stimulation is reduced during the SWDs. While some data show this nicely, the authors also report on lines 396-8 "When comparing statistical responses between both states, significant changes (p<0.05, cluster-) were noticed in somatosensory auditory frontal..., with these regions being less activated in interictal state (see also Figure 4). That statement is at odds with their conclusion.

      We thank the reviewer for noting this discrepancy. The statement should have been written vice versa and it has been corrected as: “When comparing statistical responses between both states, significant changes (p<0.05, cluster-level corrected) were noticed in the somatosensory, auditory and frontal cortices: these regions were less activated in ictal than in interictal state (see also Figure 4).”

      They also conclude that stimulation slows the pathways activated by the stimulus. I do not see any data proving this. It would require repeated assessments of the pathways in time.

      We agree with the reviewer that there are no data showing slowing of the pathways in response to stimulus. However, we are a bit confused about this comment, as to what part in conclusion section it refers to. We did not intentionally claim that stimulation slows the activated pathways in the manuscript.

      The authors also study the hemodynamic response function (HRF) and it is not clear what conclusions can be made from the data.

      Hemodynamic response functions were studied for two reasons:

      • To account for possible change in HRF during the detection of activated regions. Indeed, a physiological change in HRF can mask the detection of an activation when the software uses a standard HRF to convolve the design matrix (David et al. 2008).

      • To characterize the shape and polarity of fMRI activations in brain regions that we noticed to be differently activated between ictal and interictal states and evaluate whether alteration in activation was associated to alteration in hemodynamic.

      The observed HRF decreases (rather than increases) in the cortex when stimulation was applied during SWD, was discussed in section 4.4., where we speculated that neuronal suppression caused by SWD can prevent responsiveness. In this case, the decreased HRF could either be a consequence or a cause of the observed neuronal suppression. The assumption that the HRF reduction is causal would be supported by a possible vascular steal effect from other activation regions. However, in the conclusion section we did not state this and therefore the following sentence was added to conclusions: “Moreover, the detected decreases in the cortical HRF when sensory stimulation was applied during spike-and-wave discharges, could play a role in decreased sensory perception. Further studies are required to evaluate whether this HRF change is a cause or a consequence of the reduced neuronal response”.

      Finally, the authors use a model to analyze the data. This model is novel and while that is a strength, its validation is unclear. The conclusion is that the modeling supports the conclusions of the study, which is useful.

      Details about the model were added.

      Strengths:

      Use of fMRI and EEG to study SWDs in rats.

      Weaknesses:

      Several aspects of the Methods and Results are unclear.

      Reviewer #3 (Public Review):

      Summary:

      This is an interesting paper investigating fMRI changes during sensory (visual, tactile) stimulation and absence seizures in the GAERS model. The results are potentially important for the field and do suggest that sensory stimulation may not activate brain regions normally during absence seizures. However the findings are limited by substantial methodological issues that do not enable fMRI signals related to absence seizures to be fully disentangled from fMRI signals related to the sensory stimuli.

      Strengths:

      Investigating fMRI brain responses to sensory stimuli during absence seizures in an animal model is a novel approach with the potential to yield important insights.

      The use of an awake, habituated model is a valid and potentially powerful approach.

      Weaknesses:

      The major difficulty with interpreting the results of this study is that the duration of the visual and auditory stimuli was 6 seconds, which is very close to the mean seizure duration per Table 1. Therefore the HRF model looking at fMRI responses to visual or auditory stimuli occurring during seizures was simultaneously weighting both seizure activity and the sensory (visual or auditory) stimuli over the same time intervals on average. The resulting maps and time courses claiming to show fMRI changes from visual or auditory stimulation during seizures will therefore in reality contain some mix of both sensory stimulation-related signals and seizure-related signals. The main claim that the sensory stimuli do not elicit the same activations during seizures as they do in the interictal period may still be true. However the attempts to localize these differences in space or time will be contaminated by the seizure-related signals.

      The claims that differences were observed for example between visual cortex and superior colliculus signals with visual stim during seizures vs. interictal are unconvincing due to the above.

      We understand this concern expressed by the reviewer and agree that seizure-related signals must be considered in the analysis when studying stimulation responses. Therefore, in modelling the responses in the SPM framework, we considered both stimulation and seizure-only states as regressors of interest and used seizure-only responses as nuisance regressors to account for error variance. Thereby, the effects caused by the stimulation should be, in theory, separated as much as possible from the effects caused by the seizure itself. Additionally, the cases where stimulations occurred fully inside a seizure (included in Figure 3, “...stimulation during ictal state) actually had a longer average seizure duration of 45 ± 60 s, therefore being much longer than 6s which an average duration taken from all seizures.

      However, we acknowledge that there is a potential that some leftover effects from a seizure are still present, and we have noted this caution in the “Physiologic and methodologic considerations” section: “We note a caution that presented maps and time courses showing fMRI changes from visual or whisker stimulation during seizures may contain mixture of both sensory stimulation-related signals and seizure-related signals. To minimize this contamination, we considered in SPM both stimulation and seizure-only states as regressors of interest and used seizure-only responses as nuisance regressors to account for error variance. Thereby, the effects caused by the seizure itself should be separated as much as possible from the effects caused by stimulation.”

      The maps shown in Figure 3 do not show clear changes in the areas claimed to be involved.

      We clarified the overall appearance of Figure 3, by enlarging the selected cross sections for better anatomical differentiation and added anterior and posterior directions on all images.

      Reviewer #1 (Recommendations For The Authors):

      1) The implementations of brain simulations need to be more specific: How is the stimulation applied in the mean-field model in terms of its mathematical expression? The state variable of the model is the rate of neuronal firing, but how is it subsequently converted into fMRI responses? How are the statistical plots calculated? How much does this result depend on the model parameter?

      Further details and explanations about the model have now been added to the manuscript. The stimulation of a specific region is simulated as an increase in the excitatory input to the specific node. In particular we use a square function for representing the stimulus (see for example panel A in Figure 6–figure supplement 1). As the referee mentions, the model describes the dynamics of the neuronal firing rates. This provides direct information about neuronal activity and responsiveness for which all the statistical analyses of the simulations shown in the paper were performed using the firing rates. For these analyses, no conversion to fMRI was needed. To build the statistical maps, an ANOVA (analysis of variance) test was used. The ANOVA test is originally designed to assess the significance of the change in the mean between two samples, and is calculated via an F-test as the ratio of the variance between and within samples. In our case it allowed us to assess the impact of the stimulation on the ongoing neuronal activity by performing a comparison of the timeseries of the firing rate with and without stimulation (this was performed independently for each state). For the results presented in this paper, the ANOVA analysis was performed using the “f_oneway” function of the scipy.stats. module in python. Regarding the dependence on the model parameter, the main results obtained in our paper are related with the responsiveness of the system under two quantitatively different types of ongoing dynamics: an asynchronous irregular activity (interictal period) and an oscillatory SWD type of dynamics (ictal period). In particular, we show how for the SWD dynamics the activity evoked by the stimulus is overshadowed by the ongoing activity which imposes a strong limitation in the response of the system and the propagation of the stimulus. In this sense, the main results of the simulations are very general, and no significant dependence on specific cellular or network parameters was observed within a physiologically relevant range or should be expected. Nevertheless, we point out that, as mentioned in the text, the key parameter that triggers the transition between the two types of dynamics is the strength of the adaptation current (in particular the strength of the spike-triggered adaptation parameter ‘b’ described in the Supplementary information), which in addition has the capacity of controlling the frequency of the oscillations. In the paper, this parameter was set such that the SWD frequency falls within the range observed in the GAERS (between 7-12Hz). We believe that further analysis around the region of transition between states, in particular from a dynamical point of view, could be of relevance for future work.

      2) In the abstract, what exactly does "typical information flow in functional pathways" mean and which part of the results does this refer to?

      We note that this sentence was overly complicated. By “typical information flow”, we were referring to sensory responsiveness during interictal state. Therefore, we made the following modifications to the abstract: “These results suggest that sensory processing observed during an interictal state can be hindered or even suppressed by the occurrence of an absence seizure, potentially contributing to decreased responsiveness.”

      3) Figure 4 - Figure Supplement 1 performed an analysis of comparing states between 'when stimulation ended a seizure' and 'stimulation during an ictal period'. The authors should explain more clearly in the manuscript what is the reason and significance of considering the state of 'when stimulation ended a seizure'. And how is a seizure considered to be terminated by stimulation rather than ending spontaneously?

      We have now added explanations to the manuscript section 2.5.3 as why this state was also of interest: “The case when stimulation ended a seizure is particularly interesting for studying the spatial and temporal aspects explaining shift from ictal, i.e. non-responsiveness state, to non-ictal, i.e. responsiveness state.” We agree that there is a possibility that seizures ended spontaneously at the same time as stimulus was applied but argue that seizures most probably end due to stimulation, based on results published previously (https://doi.org/10.1016/j.brs.2012.05.009).

      4) In Section 3.1, some detailed descriptions of methods should be moved to Section 2, e.g. how the spatial and temporal SNR is obtained and the description of bad quality data. Also, I suggest the significance of selecting the optimal MRI sequence be stated earlier in the paper, as Section 3.1 cannot be expected from reading the abstract and introduction.

      We moved some technical explanations of SNRs from section 3.1. to section 2.4.1. Significance of the selection of the MRI sequence is also now stated earlier in the introduction section: “For this purpose, the functionality of ZTE sequence was first piloted, and selected over traditional EPI sequence for its lower acoustic noise and reduced magnetic susceptibility artefacts. The selected MRI sequence thus appeared optimal for awake EEG-fMRI measurements.”

      Some minor issues:

      1) How is ROI defined in this paper? What type of atlas is used?

      Anatomical ROIs were drawn based on Paxinos and Watson rat brain atlas 7th edition. Region was selected if there were statistically significant activations detected inside that region, based on activation maps. We clarified the definition of ROI as the following: “Anatomical ROIs, based on Paxinos atlas (Paxinos and Watson rat brain atlas 7th edition), were drawn on the brain areas where statistical differences were seen in activation maps.”

      2) Section 4.3.2, "In addition, some responses were seen in the somatosensory cortex during the seizure state, which may be due to the fact that the linear model used did not completely remove the effect of the seizure itself" What is the reason for the authors to make such comments?

      This claim was made because we saw similar trend of responses (deactivation) in F-contrast maps in the somatosensory cortex, when comparing “stimulation during ictal state” maps to "seizure map", leading us to assume that the effect of seizure was still apparent in the maps (even though “seizure only” states were used as nuisance regressors). However, as this claim is highly speculative, we have decided to delete this sentence in the manuscript.

      3) Abbreviations such as SPM, HRF, CBF, etc. are not defined in the manuscript.

      Definitions for these abbreviations were added.

      4) Supplementary information-AdEx mean-field model, 've and vi', e and i should be subscripted.

      Subscripts were added.

      Reviewer #2 (Recommendations For The Authors):

      Below are more detailed questions and concerns. Many questions are about the Methods, which seem to be written by a specialist. However, there are also questions about the experimental approach and conclusions.

      One of the strengths of the study is the use of fMRI and EEG. However, to allow rats to be still in the magnet, isoflurane was used, and then as soon as rats recovered they were imaged. However isoflurane has effects on the brain long after the rats have appeared to wake up. Moreover, to train rats to be still, repetitive isoflurane sessions had to be used. Repetitive isoflurane should have a control of some kind, or be discussed as a limitation.

      The repetitive use of isoflurane is indeed an important limiting factor that was not yet discussed in the manuscript. We have added the following sentences to the “Physiologic and methodologic considerations” section:

      “As the used awake habituation and imaging protocol didn’t allow us to avoid the usage of isoflurane during the preparation steps, we cannot rule out the possible effect of using repetitive anesthesia on brain function. However, duration (~15 min) and concentration of anesthesia (~1.5%) during these steps were still moderate, whereas extended durations (1-3 h) of either single or repetitive isoflurane exposures have been used in previous studies where long-term effects on brain function have been observed (Long II et al., 2016; Stenroos et al., 2021). Moreover, there was a 5-15 min waiting period between the cessation of anesthesia and initiation of fMRI scan, to avoid the potential short-term effects of isoflurane that has been found to be most prominent during the 5 min after isoflurane cessation (Dvořáková et al., 2022).

      An assumption of the study is that interictal periods are normal. However, they may not be. A control is necessary. One also wants to know how often GAERS have spontaneous spike-wave discharges (SWDs), what the authors call seizures. The reason is that the more common the SWDs, the less likely interictal periods are normal. It seems from the Methods that rats were selected if they had frequent seizures so many could be captured in a recording session. Those without frequent seizures were discarded.

      A good control would be a normal rat that has spontaneous SWDs, since almost all rat strains have them, especially with age and in males (PMID: 7700522). However, whether they are frequent enough might be a problem. Alternatively, animals could be studied with rare seizures to assess the normal baseline, and compared to interictal states in GAERS.

      We appreciate this concern raised by the Reviewer. Even though it would be interesting to study different strains and SWD frequency dependence, the aim of this study was to compare interictal vs ictal states in this specific animal model. We also understand that interictal periods could not necessarily model “normal” state and therefore went through the manuscript again to remove any claims referring to this.

      About the mechanisms of SWDs, the authors should update their language which seems imprecise and lacks current citations (starting on line 71):

      "Although the origin of absence seizures is not fully understood, current studies on rat models of absence seizures suggest that they arise from atypical excitatory-inhibitory patterns in the barrel field of the somatosensory cortex (Meeren et al. 2002; Polack et al. 2007) and lead to synchronous cortico-thalamic activity (Holmes, Brown, and Tucker 2004)."

      Some of the best explanations for SWDs that I know of are from the papers of John Huguenard. His reviews are excellent. They discuss the mechanisms of thalamocortical oscillations.

      We have reformatted the sentences discussing the mechanism of SWDs and included the explanations provided by manuscripts from Huguenard and McCafferty et al.: “Although the origin of absence seizures is not fully understood, current studies on rat models of absence seizures suggest that they arise from excitatory drive in the barrel field of the somatosensory cortex (Meeren et al. 2002; Polack et al. 2007, 2009, David et al., 2008) and then propagate to other structures (David et al., 2008) including thalamus, knowing to play an essential role during the ictal state (Huguenard, 2019). Notably, the thalamic subnetwork is believed to play a role in coordinating and spacing SWDs via feedforward inhibition together with burst firing patterns. These lead to the rhythms of neuronal silence and activation periods that are detected in SWD waves and spikes (McCafferty et al., 2018; Huguenard, 2019).”

      The following also is not precise:

      "Although seizures are initially triggered by hyperactive somatosensory cortical neurons, the majority of neuronal populations are deactivated rather than activated during the seizure, resulting in an overall decrease in neuronal activity during SWD (McCafferty et al. 2023)." What neuronal populations? Cortex? Which neurons in the cortex? Those projecting to the thalamus? What about thalamocortical relay cells? Thalamic gabaergic neurons?

      Lines 85-8: "In addition, a previous fMRI study on GAERS, which measured changes in cerebral blood volume, found both deactivated and activated brain areas during seizures (David et al. 2008). Which areas and conditions led to reduced activity? Increased activity? How was it surmised?

      "concurrent stimuli and therefore could contribute to the alterations in behavioral responsiveness" - This idea has been raised before by others (Logthetis, Barth). Please discuss these as the background for this study.

      The particular section was modified to the following:

      “Previous results on GAERS have indicated that, during an absence seizure, hyperactive electrophysiological activity in the somatosensory cortex can contribute to bilateral and regular SWD firing patterns in most parts of the cortex. These patterns propagate to different cortical areas (retrosplenial, visual, motor and secondary sensory), basal ganglia, cerebellum, substantia nigra and thalamus (David et al. 2008; Polack et al. 2007). Although SWDs are initially triggered by hyperactive somatosensory cortical neurons, neuronal firing rates, especially in majority of frontoparietal cortical and thalamocortical relay neurons, are decreased rather than increased during SWD, resulting in an overall decrease in activity in these neuronal populations (McCafferty et al. 2023). Previous fMRI studies have demonstrated blood volume or BOLD signal decreases in several cortical regions including parietal and occipital cortex, but also, quite surprisingly, increases in subcortical regions such as thalamus, medulla and pons (David et al., 2008; McCafferty et al., 2023). In line with these findings, graph-based analyses have shown an increased segregation of cortical networks from the rest of the brain (Wachsmuth et al. 2021). Altogether, alterations in these focal networks in the animal models of epilepsy impairs cognitive capabilities needed to process specific concurrent stimuli during SWD and therefore could contribute to the lack of behavioral responsiveness (Chipaux et al. 2013; Luo et al. 2011; Meeren et al. 2002; Studer et al. 2019), although partial voluntary control in certain stimulation schemes can be still present (Taylor et al., 2017).”

      Please discuss the mean-field model more. What are its assumptions? What is its validation? Do other models also provide the same result?

      We have now extended the discussion and explanation of the mean-field model, both in the main text and in the Supplementary information. The mean-field model is a statistical tool to estimate the mean activity of large neuronal populations, and as such its main assumptions are centered around the size of the population analyzed and the characteristic times of the neuronal dynamics under study. It has been shown that the formalism is valid for characteristic times of neuronal dynamics with a lower bond in the order of few milliseconds and with population size of in the order thousands of neurons (see El Boustani and Destexhe, Neural computation 2009; and Di Volo et al, Neural computation 2019), with both conditions satisfied in the simulations made for this work. Regarding the validation, the model has been extensively validated and used for simulating different brain states (Di Volo et al. 2009; Goldman et al. 2023), signal propagation in cortical circuits (Zerlaut et al, 2018) and to perform whole-brain simulations (Goldman et al, 2023). The standard validation of the mean-field implies its comparison with the activity obtained from the corresponding spiking neural network. For completeness we show in Author response image 1 an example of the SWD type of dynamics obtained from a spiking neural network together with the one obtained from the mean-field. This figure has been added now to the Supplementary information of the paper. Regarding the extension of the results to other models, we think that the generality of our results is an interesting point from our work. The main results obtained from our simulation are related with the responsiveness of the system during two different type of ongoing activity: in the interictal state there is a significant variation on the ongoing activity evoked by the stimulation that is propagated to other regions, while in the SWD state the evoked activity is overshadowed by the ongoing activity which imposes a strong limit to the responsiveness of the system and the propagation of the signal. In this sense, the results of the simulations are very general and should be extensible to other models. Of course, the advantage of using a model like ours is the capability of reproducing the different states, its applicability to large scale simulations, and the fact that it is built from biologically relevant single-cell models (AdEx).

      Author response image 1.

      Comparison of the SWD dynamics in the mean-field model and the underlying spiking-neural network of AdEx neurons. A) Raster plot (top) and mean firing rate (bottom) from an SWD type of dynamics obtained from the spiking- network simulations. The network is made of 8000 excitatory neurons and 2000 inhibitory neurons. Neurons in the network are randomly connected with probability p=0.05 for inhibitory-inhibitory and excitatory-inhibitory connections, and p=0.06 for excitatory-excitatory connections. Cellular parameters correspond to the ones used in the mean-field, with spike-triggered adaptation for excitatory neurons set to b=200pA. We show the results for excitatory (green) and inhibitory (red) neurons. B) Mean-firing rate obtained from a single mean-field model. We see that, although the amplitude of oscillations is larger in the spiking-network, the mean-field can correctly capture the general dynamics and frequency of the oscillations.

      Line 11: "rats were equally divided by gender." Given n=11, does that mean 5 males and 6 females or the opposite?

      Out of 11 animals, 6 were males, and 5 females. This is now mentioned in the manuscript.

      What was the type of food?

      Type of food was added to the manuscript (Extrudat, vitamin-fortified, irradiated > 25 kGy)

      What were the electrodes?

      This was provided in the manuscript. Carbon fiber filament was produced by World Precision Instruments. The tips of this filament were spread to brush-like shape to increase the contact surface above the skull.

      "low noise zero echo time (ZTE) MRI sequence"- please explain for the non-specialist or provide references.

      Reference added.

      Lines 148-150: "The length of habituation period was selected based on pilot experiments and was sufficient for rats to be in low-stress state and produce absence seizures inside the magnet." How do the authors know the rats were in a low-stress state?

      This claim was based on two factors. At the end of the habituation protocol, the motion of animals was considerably decreased according to previous study using similar restraint/habituation protocol (DOI: 10.3389/fnins.2018.00548). In this study the decreased motion is also correlated with decreased blood corticosterone levels which reduced to baseline levels (indicating low-stress state) after 4 days of habituation. Another factor is when epileptic rodents are continuously recorded for 24h, most SWDs occur during a state of passive wakefulness or drowsiness (Lannes et al. 1988, Coenen et al. 1991) . Either way, as we don’t have a way to provide direct evidence of low-stress state, we modified the sentence to the following:

      “The length of habituation period was selected based on pilot experiments to provide low-motion data therefore giving rats a better chance to be in a low-stress state and thus produce absence seizures inside the magnet.”

      Lines 150-2: "Respiration rate and motion were monitored during habituation sessions using a pressure pillow and video camera to estimate stress level." What were the criteria for a high stress level?

      Criteria for high (or low) stress levels were based mostly on motion levels according to previous study (DOI: 10.1016/s0149-7634(05)80005-3). Still, as we didn’t measure direct measures of stress, we modified the sentence to the following:

      “Pressure pillow and video camera were used to estimate physiological state, via breathing rate, and motion level, respectively.”

      Lines 152-3: "During the last habituation session, EEG was measured to confirm that the rats produced a sufficient amount of absence seizures (10 or more per session)." If 10 min, the rats would basically be seizing the entire session, leading to doubt about what the interictal state was.

      The length of the last habituation session was 60min and the fMRI scan 45min. Given that rats produced ~40-50 seizures during fMRI scan, on average they produced ~1 seizures/min, and one seizure lasting on average of 5-6s, giving ~45s periods for interictal states. 10 or more seizures were used as a threshold to give statistically meaningful findings based on pilot experiments.

      Line 153: "Total of 2-5 fMRI experiments were conducted per rat within a 1-3-week period." What was the schedule for each animal? A table would be useful. If it varied, how do the authors know this was justified?

      Please see Figure 1–figure supplement 2 for examples of habituation timelines for individual rats:

      We found an error when stating 2-5 fMRI experiments, but it should be 3-5 fMRI experiments. This was corrected. We had an aim to acquire 12-14 sessions per stimulation condition and once a sufficient number of sessions were acquired, part of the animals was not used further. Two of the animals that were found to have good quality EEG and produced sufficient amounts of SWDs were kept, and briefly retrained for later second stimulation condition experiments. This was done to replace animals that needed to be excluded in the second stimulation condition due to bad quality EEG or lost implant. Extended use of some animals could theoretically bring slight variation to results but could actually be an advantage as animals were already well trained providing low-motion data.

      "Before and after each habituation session, rats were given a treat of sugar water and/or chocolate cereals as positive reinforcement. " How much and what was the concentration of sugar water; chocolate cereal?

      Rats were given 3 chocolate cereals and/or 1% sugar water. This was added to the manuscript now.

      Line 188: "We relied on pilot calibration of the heated water to maintain the body temperature" Please explain.

      Sentence was clarified:

      “We relied on pilot calibration of the temperature of heated water circulating inside animal bed to maintain the normal body temperature of ~37 °C"

      Line 190: "After manual tuning and matching of the transmit-receive coil, shimming and anatomical imaging" Please explain for the non-specialist.

      Sentence was simplified:

      “After routine preparation steps in the MRI console were done"

      Lines 199-201: "Anatomical imaging was conducted with a T1-FLASH sequence (TR: 530 ms, TE: 4 ms, flip angle 196 18{degree sign}, bandwidth 39,682 kHz, matrix size 128 x 128, 51 slices, field-of-view 32 x 32 mm², resolution 0.25 x 0.25 x 0.5 mm3). fMRI was performed with a 3D ZTE sequence (TR: 0.971 ms, TE: 0 ms, flip angle 4{degree sign}, pulse length 1 µs, bandwidth 150 kHz, oversampling 4, matrix size 60 x 60 x 60, field-of-view 30 x 30 x 60 mm3 , resolution of 0.5 x 0.5 x 1 mm3 , polar under sampling factor 5.64 nr. of projections 2060 resulting to a volume acquisition time of about 2 s). A total of 1350 volumes (45 min) were acquired." Please explain for the non-specialist.

      These technical parameters are provided for the sake of repeatability. Section was however clarified as the following and citation was added:

      Anatomical imaging was conducted with a T1-FLASH sequence (repetition time: 530 ms, echo time: 4 ms, flip angle 18°, bandwidth 39,682 kHz, matrix size 128 x 128, 51 slices, field-of-view 32 x 32 mm², spatial resolution 0.25 x 0.25 x 0.5 mm3). fMRI was performed with a 3D ZTE sequence (repetition time: 0.971 ms, TE: 0 ms, flip angle 4°, pulse length 1 µs, bandwidth 150 kHz, oversampling 4, matrix size 60 x 60 x 60, field-of-view 30 x 30 x 60 mm3, spatial resolution of 0.5 x 0.5 x 1 mm3, polar under sampling factor 5.64, number of projections 2060 resulting to a volume acquisition time of about 2 s (look Wiesinger & Ho, 2022 for parameter explanations)). A total of 1350 volumes (45 min) were acquired.

      "Visual (n=14 sessions, 5 rats) and somatosensory whisker (n=14 sessions, 4 rats)" - Please explain how multiple sessions were averaged for a single rat. Please justify the use of different numbers of sessions per rat.

      All the sessions belonging to the same stimulus scheme (multiple sessions per rat) were put at the once as sessions in SPM analysis together with all the stimulus conditions belonging to these sessions. Justifications for using a different number of sessions per rat, were given above.

      Lines 205-206: "For the visual stimulation, light pulses (3 Hz, 6 s total length, pulse length 166 ms) were produced by a blue led, and light was guided through two optical fibers to the front of the rat's eyes. What wavelength of blue? Why blue? Is the stimulation strong? Weak?

      Wavelength was 470 nm and brightness 7065 mcd with a current of 20mA. Blue was selected as it is in the frequency range that rat can differentiate and this color has been used in previous literature ( https://doi.org/10.1016/j.neuroimage.2020.117542, https://doi.org/10.1016/j.jneumeth.2021.109287)

      Line 212: "Stimulation parameters were based on previous rat stimulation fMRI studies to produce robust responses" What is a robust response? One where a lot of visual cortical voxels are activated?

      Sentence was corrected as the following:

      “Stimulation parameters were based on previous rat stimulation fMRI studies and chosen to activate voxels widely in visual and somatosensory pathways, correspondingly.”

      Line 245: "Seizures were confirmed as SWDs if they had a typical regular pattern, had at least double the amplitude compared to baseline signal..." What was the "typical" pattern? What baseline signal was it compared to? Was the baseline measured as an amplitude? Peak to trough?

      Sentence was corrected to the following:

      “Seizures were confirmed as SWDs if they had a typical regular spike and wave pattern with 7-12 Hz frequency range and had at least double the amplitude compared to baseline signal. All other signals were classified as baseline i.e. signal absent of a distinctive 7-12 Hz frequency power but spread within frequencies from 1 to 90 Hz.”

      "using rigid, affine, and SYN registrations" Please explain for the non-specialist.

      Corrected as the following:

      “using rigid, affine (linear) and SYN (non-linear) registrations”

      Line 274-5: "However, there were also intermediate cases where the seizure started or ended during the stimulation block (Figure 1 - Figure Supplement 1). These intermediate cases were modeled as confounds" Why confounds? They could be very interesting because the stimulation may not be affected if timed at the end of the seizure. What was the definition of start and end? Defining the onset and end of seizures is tricky.

      We agree that these cases are also highly interesting. Indeed, all the intermediate cases were also analyzed separately but not included in the manuscript (other than the case when stimulation immediately ended a seizure) as no statistical findings were found when comparing these cases to the baseline. E.g. for the case when stimulation was applied towards the end of seizure, it provided weakened responses but still stronger compared to case when stimulation was applied fully during a seizure (indicating some responsiveness after the cessation of seizure). As these intermediate cases led to results with higher variance, we considered them as confounds in the general linear model (i.e. reducing unwanted variance from the results of interests).

      Definition of onset and end of seizure can be difficult in some cases. When looking at the signal itself, especially towards the end of seizure the amplitude of SWDs can get weaker and thus the shift from seizure to baseline signal can be more problematic to differentiate. However, when looking at the power spectrum the boundaries were more easily detectable. Thus, in the definitions of onsets and ends of seizure we relied on both the signal and power spectrum (stated in the manuscript).

      "in the SPM analysis" Please explain for the non-specialist.

      Definition of SPM together with a link to software site was added.

      Line 276: "of fMRI data (see 2.5.3.) and thus explained variance that was not accounted for by the main effects of interest. " Please clarify.

      Clarified as:

      “Intermediate cases, where the seizure started or ended during the stimulation block (Figure 1–figure supplement 1), were considered as confounds of no-interest in the SPM analysis of fMRI data and the explained variance caused by the confounds were reduced from the main effects of interests”

      Line 277: "Additionally, a contrast..." What is meant?

      This chapter in 2.5.3. was modified as a whole to be more clear.

      Line 278-9: "...was given to two cases: i) when stimulation ended a seizure (0-2 s between stimulation start and seizure end)..." Again, how is the seizure onset and end defined?

      Look comment above.

      Lines 281-2: "Stimulations that did not fully coincide with a seizure were considered as nuisance regressors in the second level analysis." What is meant by nuisance regressor?

      Reference to SPM 12 manual was given for technical terms referring to analysis software.

      Lines 283-8: "Motion periods were also included as multiple regressors (not convolved with a basis function) to be used as nuisance regressors. Stimulations that coincided with a motion above 0.3% of the voxel size were not considered stimulation inputs. Stimulation and seizure inputs were convolved with "3 gamma distribution basis functions" (i.e. 3rd 285 order gamma) in SPM (option: basis functions, gamma functions, order: 3), to account for temporal and dispersion variations in the hemodynamic response. The choice of 3rd order gamma was based on the expectation that time-to peak and shape of HRFs of seizure could vary across voxels (David et al. 2008)." Please explain the technical terms.

      Reference for SPM 12 manual was given for technical terms referring to analysis software, and HRF was defined.

      "BAMS rat connectome" - Please explain the technical terms.

      Modified as:

      “…connection matrix of the rat nervous system (BAMS rat connectome, Bota, Dong, and Swanson 2012).”

      Results

      After removing problematic animals and sessions, was there sufficient power? There probably wasn't enough to determine sex differences.

      After removing problematic sessions, we found statistically significant results (multiple comparison corrected) results in both activation maps, and hemodynamic responses. To determine sex differences, there were not enough animals for statistical findings (p>0.05).

      Figure 2 - I don't understand "tSNR" here. What is the point here?

      B vs C. Are these different brain areas or the same but SNR was adjusted?

      D. Where is FD explained? I think explaining what the parts of the figure show would be helpful.

      tSNR, the temporal signal-to-noise ratio, demonstrates the behavior of noise through time. Readers who are planning to mimic the used awake fMRI protocol together with the single loop coil, might be interested on data quality aspect, and ability for the coil to capture signal from noise, as it is one of the most important factors in fMRI designs where small signal changes have to be distinguished from the background noise.

      B and C illustrate the same brain area, but B was acquired with high resolution anatomical scanning (T1 FLASH), and C was acquired with low resolution ZTE scanning. We clarified the figure legend to the following:

      “…spatial signal-to-noise ratios of an illustrative high resolution anatomical T1-FLASH (B), and low resolution ZTE image (C)

      FD was explained in section 2.5.1. Some parts of the explanation were clarified: “Framewise displacement (FD) (Figure 2E) was calculated as follows. First, the differential of successive motion parameters (x, y, z translation, roll, pitch, yaw rotation) was calculated. Then absolute value was taken from each parameter and rotational parameters were divided by 5 mm (as estimate of the rat brain radius) to convert degrees to millimeters (Power et al. 2012). Lastly, all the parameters were summed together.”

      Table 1 has no statistical comparisons.

      Table 1 is purely an illustration of stimulation and seizure occurrence. There is no specific interest to compare stimulation types (in what state of seizure it occurred) as it does not provide any meaningful inferences to the study.

      Statistical activation maps - it is not clear how this was done.

      Creation of statistical maps are explained in section 2.5.3.

      Line 384-5: "In addition, some responses were observed in the somatosensory cortex during a seizure state, probably due to incomplete nuisance removal of the effect of the seizure itself by the linear model used." I don't see why the authors would not suggest that the result is logical given that stimuli should activate the somatosensory cortex.

      Sentence was modified as the following:

      “In addition, responses were observed in the somatosensory cortex during a seizure state”

      Fig 3 "F-contrast maps." Please explain.

      Creation of statistical maps are explained in section 2.5.3.

      HRF- please define. The ROI selection is unclear - it "was based on statistical differences seen in activation maps." But how were ROIs drawn? Also, why were HRFs examined at the end of seizures?

      HRF was defined, and definitions of HRF and ROI were moved from results section 3.3. to method section 2.5.3.

      Definition of ROI was clarified:

      “Anatomical ROIs, based on Paxinos atlas (Paxinos and Watson rat brain atlas 7th edition), were drawn on the brain areas where statistical differences were seen in activation maps.”

      HRFs were estimated additionally at the end of seizure as it was specifically interesting to study brain state shifts from ictal to interictal. This shift was also providing us statistically significant findings in means that brain responses differed from ictal stimulation.

      Line 421: "Interestingly, the response amplitude was higher when the stimulation ended a seizure compared to when it did not" Why is this interesting?

      Word “interestingly” was changed to “additionally” to avoid any inferences in the results section.

      Line 427: "Notably, HRFs amplitudes were both negatively and positively signed during the ictal 427 state, depending on the brain region." Why is this notable?

      Word “notably” was removed to avoid any inferences in the results section.

      Please explain the legends of Figures 4 and 6 more clearly.

      Figure 4, and figure 4 – figure supplement 1, legends were clarified:

      “HRFs was calculated in selected ROI, belonging to visual or somatosensory area, by multiplying gamma basis functions (Figure 1–figure supplement 1, B) with their corresponding average beta values over a ROI and taking a sum of these values.”

      Using the comments above as a guide, please revise the Discussion to be more precise and more clear about what was shown and what can be concluded in light of limitations. Please ensure the literature is cited where appropriate.

      Some parts of the discussion and conclusion sections were modified.

      Reviewer #3 (Recommendations For The Authors):

      Minor comments:

      Formatting: fMRI maps in Figures 3 and 5 should be more clearly labeled, indicating anterior and posterior directions on all images, and the cross sections should be enlarged to enable anatomical areas to be more clearly differentiated.

      Anterior and posterior directions were added, and cross sections were enlarged.

      The Methods section 2.41 and other places in the text, and Figure 2 - Figure Supplement 1 say that there was less artifact on the EEG with ZTA than with GE-EPI. However the EEG shown in Figure 2 - Figure Supplement 1 Part C shows much more artifact in the left (ZTE) trace than the right (GE-EPI) trace. This apparent contradiction should be resolved.

      The figure was actually demonstrating the relative change to the signal when MRI sequences were on, and by this standard, the ZTE produced both less amplitude and frequency changes than EPI. In the example figure, the baseline fluctuations in the EEG trace in the left were higher in amplitude than in the right, and this could potentially lead to misconception of ZTE producing more noise. Figure legend was clarified to highlight relative change:

      “ZTE also caused relatively less artificial noise on EEG signal, keeping both amplitude of the signal and frequencies relatively more intact, which improved live detection of absence seizures.”

      Figure 2 - Supplement 1, part B horizontal axis should provide units.

      Units were added.

      Figure 2 - Supplement 1, legend last sentence says arrows mark the beginning of each "sequence." Is this a typo and should this instead say "each seizure"?

      Should state “each fMRI sequence” which was corrected.

      Line 307, Methods "to reveal brain areas where ictal stimulation provided higher amplitude response than interictal" - should this be reversed, ie weren't the authors analyzing a contrast to determine where interictal signals were higher than ictal signals?

      This should be reversed, and was corrected, thank you for noting this.

      Figure 6 - Figure Supplement 1, the scales are very different for many of the plots so they are hard to compare. Especially in the ictal periods (D, E, F) it is hard to see if any changes are happening during ictal stimulation similar to interictal stimulation due to very different scales. The activity related to SWD is so large that it overshadows the rest and perhaps should be subtracted out.

      We point out that Figure 6 - Figure Supplement 1 reproduces with a higher level of detail the results shown of Figure 6 from the main text, where all signals are plotted in the same scale. The difference between scales used in this figure is intended, and its purpose is to show and highlight the large differences observed on the ongoing activity and the evoked response between the two states (ictal and interictal). In interictal periods the ongoing activity is characterized by fluctuations around a baseline level whose variance is highly affected by the application of the stimulus. On the contrary, ictal periods are characterized by large oscillations, with periods of high and synchronized activity followed by periods of nearly no activity, where the effect of the stimulus on the dynamics is overshadowed by the ongoing dynamics (both from local and from afferent nodes) as the referee mentions, and which imposes a strong limit to the responsiveness of the system and the propagation of the signal.

    1. Author Response

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      Assessment:

      The manuscript titled 'Rab7 dependent regulation of goblet cell protein CLCA1 modulates gastrointestinal 1 homeostasis' by Gaur et al discusses the role of Rab7 in the development of ulcerative colitis by regulating the lysosomal degradation of Clca1, a mucin protease. The manuscript presents interesting data and provides a potential molecular mechanism for the pathological alterations observed in ulcerative colitis. Gaur et al demonstrate that Rab7 levels are lowered in UC and CD. However, a similar analysis of Rab7 levels in ulcerative colitis (UC) and Crohn's disease (CD) patient samples was conducted recently (Du et al, Dev Cell, 2020) which showed that Rab7 levels are found to be elevated under these conditions. While Gaur et al have briefly mentioned Du et al's paper in passing in the discussion, they need to discuss these contradictory results in their paper and clarify these differences. Additionally, Du et al are not included in the list of references.

      Strengths:

      The manuscript used a multi-pronged approach and compares patient samples, mouse models of DSS, and protocols that allow differentiation of goblet cells. They also use a nanogel-based delivery system for siRNAs, which is ideal for the knockdown of specific genes in the gut.

      Weaknesses:

      (1) Du et al, Dev Cell 2020 (https://doi.org/10.1016/j.devcel.2020.03.002) have previously shown that Rab7 levels are elevated in a similar set of colonic samples (age group, number etc.) from UC and CD patients. Gaur et al have not discussed this paper or its findings in detail, which directly contradicts their results. Clarification regarding this should be provided.

      We thank and appreciate the reviewer for bringing this point.

      The results shown by Du et al, Dev Cell, 2020 depict elevated expression of Rab7 in UC and CD patients compared to controls. In first occurrence, these results appear contradictory, but there may be a few possible explanations for this.

      Firstly, Rab7 expression levels may fluctuate in the tissue depending on the degree of the gut inflammation. This can be concluded from our observations in DSS-mice dynamics model and the human patient samples with mild and moderate UC. Furthermore, Du et al provide no information of the severity of the condition among the patients employed in the study. Our motive, in the current work, was to emphasize this aspect. This point was mentioned in the discussion section of the manuscript. However, in view of the reviewer’s concern, we have now added a detailed comment on this in the main text of the revised version of the manuscript.

      Secondly, the control biopsies in our investigation were acquired from non-IBD patients, and not what was done by Du et al., wherein biopsies from the normal para-carcinoma region of the colorectal cancer patients were used. One cannot overlook the fact that physiological and molecular changes are apparent even in non-inflamed regions in the gut of an IBD or CRC patient. It is possible that the observed discrepancy arises due to the differences in the sample type used for comparing the Rab7 expression.

      Finally, the main sub-tissue region showing a decrease in Rab7 expression in UC samples, appeared to be the Goblet cells which was not covered by Du et al.

      Keeping these points in mind we do not think that there is a contradiction in our findings with that of Du et al., 2020. In the revised submission some of these explanations are incorporated (Lines 106-109).

      This was an oversight from our side. We have actually mentioned Du et al., 2020 in the discussion (line number 345) but somehow the reference was missing in the main list. We have ensured that the reference is included in the revised version and that their findings are included both in main text and in the discussion.

      Reviewer #2 (Public Review):

      Summary:

      In this work, the authors report a role for the well-studied GTPase Rab7 in gut homeostasis. The study combines cell culture experiments with mouse models and human ulcerative colitis patient tissues to propose a model where, Rab7 by delivering a key mucous component CLCA1 to lysosomes, regulates its secretion in the goblet cells. This is important for the maintenance of mucous permeability and gut microbiota composition. In the absence of Rab7, CLCA1 protein levels are higher in tissues as well as the mucus layer, corroborating with the anticorrelation of Rab7 (reduced) and CLCA1 (increased) from ulcerative colitis patients. The authors conclude that Rab7 maintains CLCA1 level by controlling its lysosomal degradation, thereby playing a vital role in mucous composition, colon integrity, and gut homeostasis.

      Strengths:

      The biggest strength of this manuscript is the combination of cell culture, mouse model, and human tissues. The experiments are largely well done and, in most cases, the results support their conclusions. The authors go to substantial lengths to find a link, such as alteration in microbiota, or mucus proteomics.

      Weaknesses:

      (1) There are also some weaknesses that need to be addressed. The association of Rab7 with UC in both mice and humans is clear, however, claims on the underlying mechanisms are less clear. Does Rab7 regulate specifically CLCA1 delivery to lysosomes, or is it an outcome of a generic trafficking defect?

      We thank the reviewer for the insightful comment. We would like to bring forth the following explanation for each these concerns:

      Our immunofluorescence imaging experiments revealed co-localization of Rab7 protein with CLCA1 and the lysosomes (Fig 7I). In addition, the absence of Rab7 affects the transport of CLCA1 to lysosomes (Fig 7J). This demonstrates that Rab7 may be involved in regulation of CLCA1 transport (presumably along with other cargo), to lysosomes selectively. However, we do recognize that the point raised by the reviewer about possible effect of a generic trafficking defect is valid.

      (2) CLCA1 is a secretory protein, how does it get routed to lysosomes, i.e., through Golgi-derived vesicles, or by endocytosis of mucous components? Mechanistic details on how CLCA1 is routed to lysosomes will add substantial value.

      As mentioned in the manuscript, the trafficking of CLCA1 protein or CLCA1-containing vesicles within the goblet cell is unknown, with no information on the proteins involved in its mobility. The switching of CLCA1 containing vesicles from the secretory route to lysosomes needs extensive investigation involving overall trafficking of the protein. Taken together, the complete answer to both these important questions will need a series of experiments and those may be interesting avenues for future research.

      (3) Why does the level of Rab7 fluctuate during DSS treatment (Fig 1B)?

      This is a very thoughtful point from the reviewer. We detected a distinct pattern of Rab7 expression fluctuation in intestinal epithelial cells after DSS-dynamics treatment in mice. Perhaps, these changes are the result of complex cellular signaling in response to the DSS treatment. Rab7, being a fundamental protein involved in protein sorting pathway, is expected to undergo alteration based on cells requirement. Presently there are no reports suggesting the regulatory mechanisms that govern Rab7 levels in the gut.

      (4) Does the reduction seen in Rab7 levels (by WB) also reflect in reduced Rab7 endosome numbers?

      We observed reduction in Rab7 expression both at RNA and protein levels. To confirm whether this alteration will lead to reduced Rab7 positive endosome numbers may require detailed investigations.

      (5) Are other late endosomal (and lysosomal) populations also reduced upon DSS treatment and UC? Is there a general defect in lysosomal function?

      There are no direct evidences showing reduction in the late endosomal and lysosomal population during gut inflammation, but few studies link lysosomal dysfunction with risk for colitis (doi: 10.1016/j.immuni.2016.05.007).

      (6) The evidence for lysosomal delivery of CLCA1 (Fig 7 I, J) is weak. Although used sometimes in combination with antibodies, lysotracker red is not well compatible with permeabilization and immunofluorescence staining. The authors can substantiate this result further using lysosomal antibodies such as Lamp1 and Lamp2. For Fig 7J, it will be good to see a reduction in Rab7 levels upon KD in the same cell.

      We used Lysotracker red in live cells followed by fixation. So, permeabilization issues were resolved. Lamp1, as suggested by the reviewer, is definitely a better marker for lysosomes in immunofluorescence studies, but is also shown to mark late endosomes (doi: 10.1083/jcb.132.4.565). As Rab7 protein also marks the late endosomes, using Lamp1 may leave the ambiguity of CLCA1 in Rab7 positive late endosomes versus lysosomes. Nevertheless, we have carried out this experiment, as suggested by the reviewer, by staining the cells with LAMP1 (author response image 1). As demonstrated in our previous data, the colocalization of CLCA1 with LAMP1 positive vesicles decreased upon Rab7 knockdown. Also, we observed a decrease in the intensity of LAMP1 staining in cells with Rab7 knockdown. Additionally, we noted a reduction in the LAMP1 staining intensity in cells where Rab7 was knocked down. This observation can be attributed to the decrease in the presence of Rab7-positive vesicles or late endosomes which also exhibit LAMP1 staining.

      Author response image 1.

      (A) Representative confocal images of HT29-MTX-E12 cells transfected with either scrambled siRNA (control) or Rab7 siRNA (Rab7Knockdown). Cells are stained with CLCA1 (green) using antiCLCA1 antibody and lysosomes with LAMP1. (B) Graph shows quantitation of colocalization between CLCA1 and LAMP1 from images (n=20) using Mander’s overlap coefficient. Inset shows zoomed areas of the image with colocalization puncta (yellow) marked with arrows.

      (7) In this connection, Fig S3D is somewhat confusing. While it is clear that the pattern of Muc2 in WT and Rab7-/- cells are different, how this corroborates with the in vivo data on alterations in mucus layer permeability -- as claimed -- is not clear.

      The data in Fig. S3D suggest the involvement of Rab7 in packaging of Muc2. The whole idea for doing this experiment was to support our observation in the Rab7KD-mice model where mucus layer was seen to be loose and more permeable in Rab7 deficient mice.

      (8) Overall, the work shows a role for a well-studied GTPase, Rab7, in gut homeostasis. This is an important finding and could provide scope and testable hypotheses for future studies aimed at understanding in detail the mechanisms involved.

      We thank the reviewer for this comment.

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      Specific questions to the authors:

      (1) Why is the dotted line in Fig. 1c at -7.5? What does this signify?

      Response: The dotted line was intended to represent the baseline; in the revised manuscript it is corrected and placed at y=0.

      (2) Du et al should be cited. Fig 6 K-Q from Du et al should be discussed and reasons for contradictory findings should be given in greater detail, rather than a single sentence in the discussion.

      Response: The reference for Du et al is included in the list and the possible reasons the findings of the current work are discussed in the main text (Line 106-109).

      (3) Fig1. Why are Rab7 levels low even in remission patient samples? Can DSS be withdrawn to induce remission followed by analysis of colonic samples?

      Response: A possible explanation for this observation could be that the restoration of Rab7 levels may not immediately follow the resolution of clinical symptoms in remission patients. After the remission initiation, the normalization of cellular processes, including the regulation of Rab7 expression, might exhibit a time lag. A thorough investigation of Rab7 levels and the allied pathways at different time points during the remission phase could provide deeper insights into the gradual dynamics of recovery. As suggested by the reviewer, DSS withdrawal induced recovery model can be utilized for understanding the same and could be a good approach for future investigations.

      (4) Fig. 2: Single-channel fluorescence should be shown.

      Response: The single channel fluorescence images are incorporated in Fig. S2.

      (5) Line 456 should be modified. 'Blind pathologist' does not read well!

      Response: The line has been modified with ‘Blinded pathologist’.

      (6) Other inflammatory markers, cytokine levels should be looked at in addition to TNF alpha.

      Response: TNF-α is a crucial mediator in intestinal inflammation, actively contributing to the development of IBD. Elevated levels of TNF-α are observed in patients of IBD (Billmeier U. et al, World J Gastroenterol. 2016). In the current work, while probing for TNF-α our primary objective was to examine this significant indicator of colitis following Rab7 knockdown in mice, aiming to gain insights into heightened gut inflammation.

      (7) Quantitation of S3D should be provided.

      Response: The dispersed expression of Muc2 was observed in n=20 cells per sample and it was a qualitative observation. The aim was to identify any changes in Muc2 packaging under Rab7 knockout conditions.

      (8) Microbiota analysis should include Rab7KD+DSS mice.

      Response: We understand the importance of this point, however, in the current work our primary objective was to specifically investigate changes in microbial diversity and abundance in Rab7KD mice compared to both DSS+CScr and CScr mice. Rab7KD+DSS mice is expected to show higher dysbiosis in comparison to DSS+CScr.

      (9) Fig 6 H and I, G. How do Clca1 levels reduce in Rab7kd +DSS relative to Scr+DSS while they are higher in Rab7kd compared to Scr. Comment.

      Response: The decreased expression of CLCA1 in the mucus of DSS+Rab7KD mice can be attributed to a consequence of significant reduction in goblet cell numbers in these mice, as evidenced by the observed loss of these cells (Fig.S3 B and Fig. S3C). CLCA1 is exclusively secreted by goblet cells, so a decline in their numbers directly affects CLCA1 levels.

      (10) How are Rab7 levels downregulated? What is the predicted mechanism?

      Response: While our current study didn't explore this aspect, it's worth noting that Rab7 protein levels undergo regulation through various mechanisms, including post-translational modifications such as Ubiquitination and SUMOylation. These modifications are known to regulate Rab7 stability, transport and recycling. Specific experiments conducted during this study (work not included in the manuscript) indicated the participation of SENP7, a deSUMOylase, in controlling the stability of Rab7 protein, particularly in the context of colitis. Additionally, goblet cell specific mechanisms are also likely to be controlling the Rab7 in the gut.

      (11) What is the explanation for opposite changes in CLCa1 RNA (down) and protein (up).

      Response: The reduction in CLCA1 at the RNA level could be associated with the decrease in goblet cell numbers during colitis. Our investigation indicates that Rab7 predominantly influences CLCA1 at the protein level by impacting its degradation pathway. It is important to acknowledge that not all the alterations in CLCA1 observed during colitis can be solely attributed to Rab7, but our study has identified a connection between Rab7 and CLCA1.

      (12) In light of Du et al, it would be interesting to see how the number of peroxisomes changes upon alteration of Rab7 levels.

      Response: The suggestion by the reviewer is noteworthy. Since, being an altogether different domain, it deviates from the primary objectives of current work. Here, our goal was specifically on exploring the role of Rab7 in goblet cell functioning. Thus is an attractive theme for future investigations.

      (13) While Gaur et al suggest in their discussion that Du et al may have observed an upregulation in Rab7 levels in different cell types of the intestine, this is not apparent from the data provided. Tissue sections should be carefully analysed to provide data supporting this observation. Differences in reagents used (antibodies) should also be considered. As far as the human patient data is concerned, it does not appear that the sample stages are very different across the two manuscripts (based on age, inclusion criteria etc.).

      Response: This has been explained in detail in our public comments.

      Reviewer #2 (Recommendations For The Authors):

      (1) In general, image-based measurements could be done better (for example, object-based statistics than pixel-based overlaps) and represented differently. It is difficult to appreciate the reduction in Rab7 levels in goblet cells in Fig 2 A, C. It might be good to show the channels separately, and perhaps use an intensity gradient LUT for the Rab7 channel.

      Response: The single channel fluorescence images are incorporated in Fig. S2.

      (2) The EM images, and particularly Fig 2F are not convincing, with an oddly square-shaped vesicle. I'm not sure what value they are adding to the interpretation.

      Response: The observed square-shaped vesicle in Fig. 2F could be attributed to the dynamic nature of vesicles within a cell. This dynamicity allows them to adopt various shapes depending on their state and function within the cell. The presence of Rab7 near vacuoles of goblet cells signify its probable involvement in the regulation of secretory function of these cells which is the key aspect being covered in this work.

      (3) A general method question concerns the definition of the distal colon. How is this decided, particularly when colon lengths are reduced upon DSS treatment?

      Response: The murine colon is divided into proximal and distal colon of mouse and has a visual difference of inner folds which are quite prominent in proximal colon. Additionally, the portion towards the rectum (predominantly distal colon) was majorly utilized for the experiments. In each case the various experimental groups were matched for the respective areas.

      (4) The use of an in vivo intestine-specific Rab7 silencing model is good. Why does Rab7 KD itself not capitulate aspects of DSS treatment, rather it seems to exacerbate it.

      Response: Our objective was to determine whether the downregulation of Rab7 during colitis was the cause or consequence of gut inflammation. Interestingly, our investigation using the murine Rab7 knockdown model revealed that the reduction of Rab7 expression in the intestine exacerbates inflammation. Subsequent analysis demonstrated that the absence of Rab7 disrupts goblet cell secretory function, consequently contributing to heightened inflammation. Our findings overall suggest that Rab7 downregulation is not merely a consequence but plays a contributory role in aggravating inflammation in the context of colitis.

      (5) The axes labels in Fig 5 are not readable. It is unclear how Rab7 KD is more similar in gut microbiota phenotypes to DSS than to CScr.

      Response: The microbial analysis revealed an abnormal composition of gut microbiota in Rab7KD mice compared to CScr. Interestingly, this composition exhibited some similarity to the inflamed gut microbiota observed in DSSScr mice. The analysis further demonstrated a shift in microbial diversity in Rab7KD mice, showcasing characteristics akin to those observed in inflamed mice. This similarity in gut microbiota phenotypes between Rab7KD and DSSScr suggests a potential link or influence of Rab7 downregulation on the microbiota, contributing to the observed similarities with DSS-induced inflammation.

      (6) The use of mucous proteomics to identify mechanisms of Rab7-mediated phenotype is a good approach. The replicates in the proteomics dataset (Fig 6F) do not seem to match. Detailing of methodology used for analysis will help to overcome these doubts.

      Response: The identified proteins in different samples of mucus proteomics were subjected to label free quantification. Subsequently, the significantly altered proteins were subjected to analysis with the False Discovery Rate (FDR) to control for potential false positives and ascertain the validity of the findings.

      (7) It will be good to see the immunoblots showing the negative correlation between Rab7 and CLCL1 in Fig 7D.

      Response: Fig. 7C shows western blot for protein expression of CLCA1of the same control and UC samples which were used in Fig. 1F to show Rab7 expression. Fig. 7D is the quantitative correlation plot for Fig. 1F (Rab7 expression) and Fig. 7C (CLCA1 expression).

      (8) Why is UC different from the DSS model for Rab7 gene expression but not protein levels? Endosomal counts could help address this.

      Response: We encountered challenges in accurately counting the individual puncta of Rab7 expression in immunofluorescence images due to the nature of tissue samples. Locating endosomes within a single cell proved to be challenging, and the proximity of many puncta made it difficult to delineate them individually. Despite these technical difficulties, the intriguing prospect of correlating Rab7 expression with endosomal counts remains a compelling aspect that may well be area for future investigations.

    1. Author response:

      (1) General Statements

      As you will see in our attached rebuttal to the reviewers, we have added several new experiments and revised manuscript to fully address their concerns.

      (2) Point-by-point description of the revisions

      Reviewer #1:

      Evidence, reproducibility and clarity

      Summary:

      The manuscript by Yang et al. describes a new CME accessory protein. CCDC32 has been previously suggested to interact with AP2 and in the present work the authors confirm this interaction and show that it is a bona fide CME regulator. In agreement with its interaction with AP2, CCDC32 recruitment to CCPs mirrors the accumulation of clathrin. Knockdown of CCDC32 reduces the amount of productive CCPs, suggestive of a stabilisation role in early clathrin assemblies. Immunoprecipitation experiments mapped the interaction of CCDC42 to the α-appendage of the AP2 complex α-subunit. Finally, the authors show that the CCDC32 nonsense mutations found in patients with cardio-facial-neuro-developmental syndrome disrupt the interaction of this protein to the AP2 complex. The manuscript is well written and the conclusions regarding the role of CCDC32 in CME are supported by good quality data. As detailed below, a few improvements/clarifications are needed to reinforce some of the conclusions, especially the ones regarding CFNDS.

      We thank the referee for their positive comments. In light of a recently published paper describing CCDC32 as a co-chaperone required for AP2 assembly (Wan et al., PNAS, 2024, see reviewer 2), we have added several additional experiments to address all concerns and consequently gained further insight into CCDC32-AP2 interactions and the important dual role of CCDC32 in regulating CME. 

      Major comments:

      (1) Why did the protein could just be visualized at CCPs after knockdown of the endogenous protein? This is highly unusual, especially on stable cell lines. Could this be that the tag is interfering with the expressed protein function rendering it incapable of outcompeting the endogenous? Does this points to a regulated recruitment?

      The reviewer is correct, this would be unusual; however, it is not the case. We misspoke in the text (although the figure legend was correct) these experiments were performed without siRNA knockdown and we can indeed detect eGFP-CCDC32 being recruited to CCPs in the presence of endogenous protein. Nonetheless, we repeated the experiment to be certain (see Author response image 1).  

      Author response image 1.

      Cohort-averaged fluorescence intensity traces of CCPs (marked with mRuby-CLCa) and CCP-enriched eGFPCCDC32(FL).

      (2) The disease mutation used in the paper does not correspond to the truncation found in patients. The authors use an 1-54 truncation, but the patients described in Harel et al. have frame shifts at the positions 19 (Thr19Tyrfs*12) and 64 (Glu64Glyfs*12), while the patient described in Abdalla et al. have the deletion of two introns, leading to a frameshift around amino acid 90. Moreover, to be precisely test the function of these disease mutations, one would need to add the extra amino acids generated by the frame shift. For example, as denoted in the mutation description in Harel et al., the frameshift at position 19 changes the Threonine 19 to a Tyrosine and ads a run of 12 extra amino acids (Thr19Tyrfs*12).

      The label of the disease mutant p.(Thr19Tyrfs12) and p.(Glu64Glyfs12) is based on a 194aa polypeptide version of CCDC32 initiated at a nonconventional start site that contains a 9 aa peptide (VRGSCLRFQ) upstream of the N-terminus we show. Thus, we are indeed using the appropriate mutation site (see: https://www.uniprot.org/uniprotkb/Q9BV29/entry). The reviewer is correct that we have not included the extra 12 aa in our construct; however as these residues are not present in the other CFNDS mutants, we think it unlikely that they contribute to the disease phenotype.  Rather, as neither of the clinically observed mutations contain the 78-98 aa sequence required for AP2 binding and CME function, we are confident that this defect contributed to the disease. Thus, we are including the data on the CCDC32(1-54) mutant, as we believe these results provide a valuable physiological context to our studies. 

      (3) The frameshift caused by the CFNDS mutations (especially the one studied) will likely lead to nonsense mediated RNA decay (NMD). The frameshift is well within the rules where NMD generally kicks in. Therefore, I am unsure about the functional insights of expressing a diseaserelated protein which is likely not present in patients.

      We thank the reviewer for bringing up this concern. However, as shown in new Figure S1, the mutant protein is expressed at comparable levels as the WT, suggesting that NMD is not occurring.

      (4) Coiled coils generally form stable dimers. The typically hydrophobic core of these structures is not suitable for transient interactions. This complicates the interpretation of the results regarding the role of this region as the place where the interaction to AP2 occurs. If the coiled coil holds a stable CCDC32 dimer, disrupting this dimer could reduce the affinity to AP2 (by reduced avidity) to the actual binding site. A construct with an orthogonal dimeriser or a pulldown of the delta78-98 protein with of the GST AP2a-AD could be a good way to sort this issue.

      We were unable to model a stable dimer (or other oligomer) of this protein with high confidence using Alphafold 3.0. Moreover, we were unable to detect endogenous CCDC32 coimmunoprecipitating with eGFP-CCDC32 (Fig. S6C). Thus, we believe that the moniker, based solely on the alpha-helical content of the protein is a misnomer.  We have explained this in the main text.

      Minor comments:

      (1) The authors interchangeably use the term "flat CCPs" and "flat clathrin lattices". While these are indeed related, flat clathrin lattices have been also used to refer to "clathrin plaques". To avoid confusion, I suggest sticking to the term "flat CCPs" to refer to the CCPs which are in their early stages of maturation.

      Agreed. Thank you for the suggestion. We have renamed these structures flat clathrin assemblies, as they do not acquire the curvature needed to classify them as pits, and do not grow to the size that would classify then as plaques. 

      Significance

      General assessment:

      CME drives the internalisation of hundreds of receptors and surface proteins in practically all tissues, making it an essential process for various physiological processes. This versatility comes at the cost of a large number of molecular players and regulators. To understand this complexity, unravelling all the components of this process is vital. The manuscript by Yang et al. gives an important contribution to this effort as it describes a new CME regulator, CCDC32, which acts directly at the main CME adaptor AP2. The link to disease is interesting, but the authors need to refine their experiments. The requirement for endogenous knockdown for recruitment of the tagged CCDC32 is unusual and requires further exploration.

      Advance:

      The increased frequency of abortive events presented by CCDC32 knockdown cells is very interesting, as it hints to an active mechanism that regulates the stabilisation and growth of clathrin coated pits. The exact way clathrin coated pits are stabilised is still an open question in the field.

      Audience:

      This is a basic research manuscript. However, given the essential role of CME in physiology and the growing number of CME players involved in disease, this manuscript can reach broader audiences.

      We thank the referee for recognizing the ‘interesting’ advances our studies have made and for considering these studies as ‘an important contribution’ to ‘an essential process for various physiological processes’ and able ‘to reach broader audiences’. We have addressed and reconciled the reviewer’s concerns in our revised manuscript. 

      Field of expertise of the reviewer:

      Clathrin mediated endocytosis, cell biology, microscopy, biochemistry.

      Reviewer #2:

      Evidence, reproducibility and clarity

      In this manuscript, the authors demonstrate that CCDC32 regulates clathrin-mediated endocytosis (CME). Some of the findings are consistent with a recent report by Wan et al. (2024 PNAS), such as the observation that CCDC32 depletion reduces transferrin uptake and diminishes the formation of clathrin-coated pits. The primary function of CCDC32 is to regulate AP2 assembly, and its depletion leads to AP2 degradation. However, this study did not examine AP2 expression levels. CCDC32 may bind to the appendage domain of AP2 alpha, but it also binds to the core domain of AP2 alpha.

      We thank the reviewer for drawing our attention to the Wan et al. paper, that appeared while this work was under review.  However, our in vivo data are not fully consistent with the report from Wan et al. The discrepancies reveal a dual function of CCDC32 in CME that was masked by complete knockout vs siRNA knockdown of the protein, and also likely affected by the position of the GFP-tag (C- vs N-terminal) on this small protein. Thus:

      -  Contrary to Wan et al., we do not detect any loss of AP2 expression (see new Figure S3A-B) upon siRNA knockdown. Most likely the ~40% residual CCDC32 present after siRNA knockdown is sufficient to fulfill its catalytic chaperone function but not its structural role in regulating CME beyond the AP2 assembly step.  

      - Contrary to Wan et al., we have shown that CCDC32 indeed interacts with intact AP2 complex (Figure S3C and 6B,C) showing that all 4 subunits of the AP2 complex co-IP with full length eGFP-CCDC32. Interestingly, whereas the full length CCDC32 pulls down the intact AP2 complex, co-IP of the ∆78-98 mutant retains its ability to pull down the β2-µ2 hemicomplex, its interactions with α:σ2 are severely reduced.  While this result is consistent with the report of Wan et al that CCDC32 binds to the α:σ2 hemi-complex, it also suggests that the interactions between CCDC32 and AP2 are more complex and will require further studies.

      - Contrary to Wan et al., we provide strong evidence that CCDC32 is recruited to CCPs. Interestingly, modeling with AlphaFold 3.0 identifies a highly probably interaction between alpha helices encoded by residues 66-91 on CCDC32 and residues 418-438 on α. The latter are masked by µ2-C in the closed confirmation of the AP2 core, but exposed in the open confirmation triggered by cargo binding, suggesting that CCDC32 might only bind to membrane-bound AP2.

      Thus, our findings are indeed novel and indicate striking multifunctional roles for CCDC32 in CME, making the protein well worth further study. 

      (1) Besides its role in AP2 assembly, CCDC32 may potentially have another function on the membrane. However, there is no direct evidence showing that CCDC32 associates with the plasma membrane.

      We disagree, our data clearly shows that CCDC32 is recruited to CCPs (Fig. 1B) and that CCPs that fail to recruit CCDC32 are short-lived and likely abortive (Fig. 1C). Wan et al. did not observe any colocalization of C-terminally tagged CCDC32 to CCPs, whereas we detect recruitment of our N-terminally tagged construct, which we also show is functional (Fig. 6F).  Further, we have demonstrated the importance of the C-terminal region of CCDC32 in membrane association (see new Fig. S7).  Thus, we speculate that a C-terminally tagged CCDC32 might not be fully functional. Indeed, SIM images of the C-terminally-tagged CCDC32 in Wan et al., show large (~100 nm) structures in the cytosol, which may reflect aggregation. 

      (2) CCDC32 binds to multiple regions on AP2, including the core domain. It is important to distinguish the functional roles of these different binding sites.

      We have localized the AP2-ear binding region to residues 78-99 and shown these to be critical for the functions we have identified. As described above we now include data that are complementary to those of Wan et al. However, our data also clearly points to additional binding modalities. We agree that it will be important and map these additional interactions and identify their functional roles, but this is beyond the scope of this paper.  

      (3) AP2 expression levels should be examined in CCDC32 depleted cells. If AP2 is gone, it is not surprising that clathrin-coated pits are defective.

      Agreed and we have confirmed this by western blotting (Figure S3A-B) and detect no reduction in levels of any of the AP2 subunits in CCDC32 siRNA knockdown cells. As stated above this could be due to residual CCDC32 present in the siRNA KD vs the CRISPR-mediated gene KO.

      (4) If the authors aim to establish a secondary function for CCDC32, they need to thoroughly discuss the known chaperone function of CCDC32 and consider whether and how CCDC32 regulates a downstream step in CME.

      Agreed. We have described the Wan et al paper, which came out while our manuscript was in review, in our Introduction.  As described above, there are areas of agreement and of discrepancies, which are thoroughly documented and discussed throughout the revised manuscript.  

      (5) The quality of Figure 1A is very low, making it difficult to assess the localization and quantify the data.

      The low signal:noise in Fig. 1A the reviewer is concerned about is due to a diffuse distribution of CCDC32 on the inner surface of the plasma membrane. We now, more explicitly describe this binding, which we believe reflects a specific interaction mediated by the C-terminus of CCDC32; thus the degree of diffuse membrane binding we observe follows: eGFP-CCDC32(FL)> eGFPCCDC32(∆78-98)>eGFP-CCDC32(1-54)~eGFP/background (see new Fig. S7). Importantly, the colocalization of CCDC32 at CCPs is confirmed by the dynamic imaging of CCPs (Fig 1B).

      (6) In Figure 6, why aren't AP2 mu and sigma subunits shown?

      Agreed. Not being aware of CCDC32’s possible dual role as a chaperone, we had assumed that the AP2 complex was intact.  We have now added this data in Figure 6 B,C and Fig. S3C, as discussed above. 

      Page 5, top, this sentence is confusing: "their surface area (~17 x 10 nm<sup>2</sup>) remains significantly less than that required for the average 100 nm diameter CCV (~3.2 x 103 nm<sup>2</sup>)."

      Thank you for the criticism. We have clarified the sentence and corrected a typo, which would definitely be confusing.  The section now reads,  “While the flat CCSs we detected in CCDC32 knockdown cells were significantly larger than in control cells (Fig. 4D, mean diameter of 147 nm vs. 127 nm, respectively), they are much smaller than typical long-lived flat clathrin lattices (d≥300 nm)(Grove et al., 2014). Indeed, the surface area of the flat CCSs that accumulate in CCDC32 KD cells (mean ~1.69 x 10<sup>4</sup> nm<sup>2</sup>) remains significantly less than the surface area of an average 100 nm diameter CCV (~3.14 x 10<sup>4</sup> nm<sup>2</sup>). Thus, we refer to these structures as ‘flat clathrin assemblies’ because they are neither curved ‘pits’ nor large ‘lattices’. Rather, the flat clathrin assemblies represent early, likely defective, intermediates in CCP formation.” 

      Significance

      Overall, while this work presents some interesting ideas, it remains unclear whether CCDC32 regulates AP2 beyond the assembly step.

      Our responses above argue that we have indeed established that CCDC32 regulates AP2 beyond the assembly step. We have also identified several discrepancies between our findings and those reported by Wan et al., most notably binding between CCDC32 and mature AP2 complexes and the AP2-dependent recruitment of CCDC32 to CCPs.  It is possible that these discrepancies may be due to the position of the GFP tag (ours is N-terminal, theirs is C-terminal; we show that the N-terminal tagged CCDC32 rescues the knockdown phenotype, while Wan et al., do not provide evidence for functionality of the C-terminal construct). 

      Reviewer #3: 

      Evidence, reproducibility and clarity (Required): 

      In this manuscript, Yang et al. characterize the endocytic accessory protein CCDC32, which has implications in cardio-facio-neuro-developmental syndrome (CFNDS). The authors clearly demonstrate that the protein CCDC32 has a role in the early stages of endocytosis, mainly through the interaction with the major endocytic adaptor protein AP2, and they identify regions taking part in this recognition. Through live cell fluorescence imaging and electron microscopy of endocytic pits, the authors characterize the lifetimes of endocytic sites, the formation rate of endocytic sites and pits and the invagination depth, in addition to transferrin receptor (TfnR) uptake experiments. Binding between CCDC32 and CCDC32 mutants to the AP2 alpha appendage domain is assessed by pull down experiments. Together, these experiments allow deriving a phenotype of CCDC32 knock-down and CCDC32 mutants within endocytosis, which is a very robust system, in which defects are not so easily detected. A mutation of CCDC32, known to play a role in CFNDS, is also addressed in this study and shown to have endocytic defects.

      We thank the reviewer for their positive remarks regarding the quality of our data and the strength of our conclusions.  

      In summary, the authors present a strong combination of techniques, assessing the impact of CCDC32 in clathrin mediated endocytosis and its binding to AP2, whereby the following major and minor points remain to be addressed: 

      - The authors show that CCDC32 depletion leads to the formation of brighter and static clathrin coated structures (Figure 2), but that these were only prevalent to 7.8% and masked the 'normal' dynamic CCPs. At the same time, the authors show that the absence of CCDC32 induces pits with shorter life times (Figure 1 and Figure 2), the 'majority' of the pits.

      Clarification is needed as to how the authors arrive at these conclusions and these numbers. The authors should also provide (and visualize) the corresponding statistics. The same statement is made again later on in the manuscript, where the authors explain their electron microscopy data. Was the number derived from there? 

      These points are critical to understanding CCDC32's role in endocytosis and is key to understanding the model presented in Figure 8. The numbers of how many pits accumulate in flat lattices versus normal endocytosis progression and the actual time scales could be included in this model and would make the figure much stronger. 

      Thank you for these comments.  We understand the paradox between the visual impression and the reality of our dynamic measurements. We have been visually misled by this in previous work (Chen et al., 2020), which emphasizes the importance of unbiased image analysis afforded to us through the well-documented cmeAnalysis pipeline, developed by us (Aguet et al., 2013) and now used by many others (e.g. (He et al., 2020)). 

      The % of static structures was not derived from electron microscopy data, but quantified using cmeAnalysis, which automatedly provides the lifetime distribution of CCPs. We have now clarified this in the manuscript and added a histogram (Fig. S4) quantifying the fraction of CCPs in lifetime cohorts  <20s, 21-60s, 61-100s, 101-150s and >150s (static). 

      - In relation to the above point, the statistics of Figure 2E-G and the analysis leading there should also be explained in more detail: For example, what are the individual points in the plot (also in Figures 6G and 7G)? The authors should also use a few phrases to explain software they use, for example DASC, in the main text. 

      Each point in these bar graphs represents a movie, where n≥12. These details have been added to the respective figure legend. We have also added a brief description of DASC analysis in the text. 

      -  There are several questions related to the knock-down experiments that need to be addressed:

      Firstly, knock-down of CCDC32 does not seem to be very strong (Figure S2B). Can the level of knock-down be quantified? 

      We have now quantified the KD efficiency. It is ~60%. This turns out to be fortuitous (see responses to reviewer 2), as a recent publication, which came out after we completed our study, has shown by CRISPR-mediated knockout, that CCD32 also plays an essential chaperone function required for AP2 assembly.  We do not see any reduction in AP2 levels or its complex formation under our conditions (see new Supplemental Figure S3), which suggests that the effects of CCDC32 on CCP dynamics are more sensitive to CCDC32 concentration than its roles as a chaperone. Our phenotypes would have been masked by more efficient depletion of CCDC32.  

      In page 6 it is indicated that the eGFP-CCDC32(1-54) and eGFP-CCDC32(∆78-98) constructs are siRNA-resistant. However in Fig S2B, these proteins do not show any signal in the western blot, so it is not clear if they are expressed or simply not detected by the antibody. The presence of these proteins after silencing endogenous CCDC32 needs to be confirmed to support Figures 6 and Figures 7, which critically rely on the presence of the CCDC32 mutants. 

      Unfortunately, the C-terminally truncated CCDC32 proteins are not detected because they lack the antibody epitope, indeed even the ∆78-98 deletion is poorly detected (compare the GFP blot in new S1A with the anti-CCDC32 blot in S1B).  However, these constructs contain the same siRNA-resistance mutation as the full length protein. That they are expressed and siRNA resistant can be seen in Fig. S2A (now Fig. S1A) blotting for GFP.

      In Figures 6 and 7, siRNA knock-down of CCDC32 is only indicated for sub-figures F to G. Is this really the case? If not, the authors should clarify. The siRNA knock-down in Figure 1 is also only mentioned in the text, not in the figure legend. The authors should pay attention to make their figure legends easy to understand and unambiguous. 

      No, it is not the case.  Thank you for pointing out the uncertainty. We have added these details to the Figure legends and checked all Figure legends to ensure that they clearly describe the data shown.  

      - It is not exactly clear how the curves in Figure 3C (lower panel) on the invagination depth were obtained. Can the authors clarify this a bit more? For example, what are kT and kE in Figure 3A? What is I0? And how did the authors derive the logarithmic function used to quantify the invagination depth? In the main text, the authors say that the traces were 'logarithmically transformed'. This is not a technical term. The authors should refer to the actual equation used in the figure. 

      This analysis was developed by the Kirchhausen lab (Saffarian and Kirchhausen, 2008). We have added these details and reference them in the Figure legend and in the text. We also now use the more accurate descriptor ‘log-transformed’.

      - In the discussion, the claim 'The resulting dysregulation of AP2 inhibits CME, which further results in the development of CFNDS.' is maybe a bit too strong of a statement. Firstly, because the authors show themselves that CME is perturbed, but by no means inhibited. Secondly, the molecular link to CFNDS remains unclear. Even though CCDC32 mutants seem to be responsible for CFNDS and one of the mutant has been shown in this study to have a defect in endocytosis and AP2 binding, a direct link between CCDC32's function in endocytosis and CFNDS remains elusive. The authors should thus provide a more balanced discussion on this topic. 

      We have modified and softened our conclusions, which now read that the phenotypes we see likely “contribute to” rather than “cause” the disease.

      - In Figure S1, the authors annotate the presence of a coiled-coil domain, which they also use later on in the manuscript to generate mutations. Could the authors specify (and cite) where and how this coiled-coil domain has been identified? Is this predicted helix indeed a coiled-coil domain, or just a helix, as indicated by the authors in the discussion?

      See response to Reviewer 1, point 4.  We have changed this wording to alpha-helix. The ‘coiled-coil’ reference is historical and unlikely a true reflection of CCDC32 structure. AlphaFold 3.0 predictions were unable to identify with certainly any coiled-coil structures, even if we modelled potential dimers or trimers; and we find no evidence of dimerization of CCDC32 in vivo. We have clarified this in the text.

      Minor comments

      - In general, a more detailed explanation of the microscopy techniques used and the information they report would be beneficial to provide access to the article also to non-expert readers in the field. This concerns particularly the analysis methods used, for example: 

      How were the cohort-averaged fluorescence intensity and lifetime traces obtained? 

      How do the tools cmeAnalysis and DASC work? A brief explanation would be helpful. 

      We have expanded Methods to add these details, and also described them in the main text. 

      - The axis label of Figure 2B is not quite clear. What does 'TfnR uptake % of surface bound' mean? Maybe the authors could explain this in more detail in the figure legend? Is the drop in uptake efficiency also accessible by visual inspection of the images? It would be interesting to see that. 

      This is a standard measure of CME efficiency. 'TfnR uptake % of surface bound' = Internalized TfnR/Surface bound TfnR. Again, images may be misleading as defects in CME lead to increased levels of TfnR on the cell surface, which in turn would result in more Tfn uptake even if the rate of CME is decreased.

      - Figure 4: How is the occupancy of CCPs in the plasma membrane measured? What are the criteria used to divide CCSs into Flat, Dome or Sphere categories? 

      We have expanded Methods to add these details. Based on the degree of invagination, the shapes of CCSs were classified as either: flat CCSs with no obvious invagination; dome-shaped CCSs that had a hemispherical or less invaginated shape with visible edges of the clathrin lattice; and spherical CCSs that had a round shape with the invisible edges of clathrin lattice in 2D projection images. In most cases, the shapes were obvious in 2D PREM images. In uncertain cases, the degree of CCS invagination was determined using images tilted at ±10–20 degrees. The area of CCSs were measured using ImageJ and used for the calculation of the CCS occupancy on the plasma membrane.

      - Figure 5B: Can the authors explain, where exactly the GFP was engineered into AP2 alpha? This construct does not seem to be explained in the methods section. 

      We have added this information. The construct, which corresponds to an insertion of GFP into the flexible hinge region of AP2, at aa649, was first described by (Mino et al., 2020) and shown to be fully functional.  This information has been added to the Methods section.

      - Figure S1B: The authors should indicate the colour code used for the structural model.

      We have expanded our structural modeling using AlphaFold 3.0 in light of the recent publication suggesting the CCDC32 interacts with the µ2 subunit and does not bind full length AP2. These results are described in the text. The color coding now reflects certainty values given by AlphaFold 3.0 (Fig. S6B, D). 

      - The list of primers referred to in the materials and methods section does not exist. There is a Table S1, but this contains different data. The actual Table S1 is not referenced in the main text. This should be done. 

      We apologize for this error. We have now added this information in Table S2.

      Significance (Required):

      In this study, the authors analyse a so-far poorly understood endocytic accessory protein, CCDC32, and its implication for endocytosis. The experimental tool set used, allowing to quantify CCP dynamics and invagination is clearly a strength of the article that allows assessing the impact of an accessory protein towards the endocytic uptake mechanism, which is normally very robust towards mutations. Only through this detailed analysis of endocytosis progression could the authors detect clear differences in the presence and absence of CCDC32 and its mutants. If the above points are successfully addressed, the study will provide very interesting and highly relevant work allowing a better understanding of the early phases in CME with implication for disease. 

      The study is thus of potential interest to an audience interested in CME, in disease and its molecular reasons, as well as for readers interested in intrinsically disordered proteins to a certain extent, claiming thus a relatively broad audience. The presented results may initiate further studies of the so-far poorly understood and less well known accessory protein CCDC32.

      We thank the reviewer for their positive comments on the significance of our findings and the importance of our detailed phenotypic analysis made possible by quantitative live cell microscopy. We also believe that our new structural modeling of CCDC32 and our findings of complex and extensive interactions with AP2 make the reviewers point regarding intrinsically disordered proteins even more interesting and relevant to a broad audience.  We trust that our revisions indeed address the reviewer’s concerns. 

      The field of expertise of the reviewer is structural biology, biochemistry and clathrin mediated endocytosis. Expertise in cell biology is rather superficial.

      References:

      Aguet, F., Costin N. Antonescu, M. Mettlen, Sandra L. Schmid, and G. Danuser. 2013. Advances in Analysis of Low Signal-to-Noise Images Link Dynamin and AP2 to the Functions of an Endocytic Checkpoint. Developmental Cell. 26:279-291.

      Chen, Z., R.E. Mino, M. Mettlen, P. Michaely, M. Bhave, D.K. Reed, and S.L. Schmid. 2020. Wbox2: A clathrin terminal domain–derived peptide inhibitor of clathrin-mediated endocytosis. Journal of Cell Biology. 219.

      Grove, J., D.J. Metcalf, A.E. Knight, S.T. Wavre-Shapton, T. Sun, E.D. Protonotarios, L.D. Griffin, J. Lippincott-Schwartz, and M. Marsh. 2014. Flat clathrin lattices: stable features of the plasma membrane. Mol Biol Cell. 25:3581-3594.

      He, K., E. Song, S. Upadhyayula, S. Dang, R. Gaudin, W. Skillern, K. Bu, B.R. Capraro, I. Rapoport, I. Kusters, M. Ma, and T. Kirchhausen. 2020. Dynamics of Auxilin 1 and GAK in clathrinmediated traffic. J Cell Biol. 219.

      Mino, R.E., Z. Chen, M. Mettlen, and S.L. Schmid. 2020. An internally eGFP-tagged α-adaptin is a fully functional and improved fiduciary marker for clathrin-coated pit dynamics. Traffic. 21:603-616.

      Saffarian, S., and T. Kirchhausen. 2008. Differential evanescence nanometry: live-cell fluorescence measurements with 10-nm axial resolution on the plasma membrane. Biophys J. 94:23332342.

    1. Author response:

      Reviewer #1 (Evidence, reproducibility and clarity):

      Minor comments:

      In the results section (lines 498-499), the authors describe free kinetochores in many cells without associated spindle microtubules. However, some nuclei appear to have kinetochores, as presented in Figure 6. Could the authors clarify how this conclusion was derived using transmission electron microscopy (TEM) without serial sectioning, as this is not explicitly mentioned in the materials and methods?

      We observed free kinetochores in the ALLAN-KO parasites with no associated spindle microtubules (see Fig. 6Gh), while kinetochores are attached to spindle microtubules in WT-GFP cells (see Fig. 6Gc). To provide further evidence we analysed additional images and found that ALLAN-KO cells have free kinetochores in the centre of nucleus, unattached to spindle microtubules. We provide some more images clearly showing free kinetochores in these cells (new supplementary Fig. S11).

      However, in the ALLAN mutant, this difference is not absolute: in a search of over 50 cells, one example of a cell with a “normal” nuclear spindle and attached kinetochores was observed.

      The use of serial sectioning has limitations for examining small structures like kinetochores in whole cells. The limitations of the various techniques (for example, SBF-SEM vs tomography) are highlighted in our previous study (Hair et al 2022; PMID: 38092766), and we consider that examining a population of randomly sectioned cells provides a better understanding of the overall incidence of specific features.

      Discussion Section:

      Could the authors expand on why SUN1 and ALLAN are not required during asexual replication, even though they play essential roles during male gametogenesis?

      We observed no phenotype in asexual blood stage parasites associated with the sun1 and allan gene deletions. Several other Plasmodium berghei gene knockout parasites with a phenotype in sexual stages, for example CDPK4 (PMID: 15137943), SRPK (PMID: 20951971), PPKL (PMID: 23028336) and kinesin-5 (PMID: 33154955) have no phenotype in blood stages, so perhaps this is not surprising. One explanation may be the substantial differences in the mode of cell division between these two stages. Asexual blood stages produce new progeny (merozoites) over 24 hours with closed mitosis and asynchronous karyokinesis during schizogony, while male gametogenesis is a rapid process, completed within 15 min to produce eight flagellated gametes. During male gametogenesis the nuclear envelope must expand to accommodate the increased DNA content (from 1N to 8N) before cytokinesis. Furthermore, male gametogenesis is the only stage of the life cycle to make flagella, and axonemes must be assembled in the cytoplasm to produce the flagellated motile male gametes at the end of the process. Thus, these two stages of parasite development have some very different and specific features.

      Lines 611-613 states: "These loops serve as structural hubs for spindle assembly and kinetochore attachment at the nuclear MTOC, separating nuclear and cytoplasmic compartments." Could the authors elaborate on the evidence supporting this statement?

      We observed the loops/folds in the nuclear envelope (NE) as revealed by SUN1-GFP and 3D TEM images during male gametogenesis. These folds/loops occur mainly in the vicinity of the nuclear MTOC where the spindles are assembled (as visualised by EB1 fluorescence) and attached to kinetochores (as visualised by NDC80 fluorescence). These loops/folds may form due to the contraction of the spindle pole back to the nuclear periphery, inducing distortion of the NE. Since there is no physical segregation of chromosomes during the three rounds of mitosis (DNA increasing from 1N to 8N), we suggest that these folds provide additional space for spindle and kinetochore dynamics within an intact NE to maintain separation from the cytoplasm (as shown by location of kinesin-8B).

      In lines 621-622, the authors suggest that ALLAN may have a broader role in NE remodelling across the parasite's lifecycle. Could they reflect on or remind readers of the finding that ALLAN is not essential during the asexual stage?

      ALLAN-GFP is expressed throughout the parasite life cycle but as the reviewer points out, a functional role is more pronounced during male gametogenesis. This does not mean that it has no role at other stages of the life cycle even if there is no obvious phenotype following deletion of the gene during the asexual blood stage. The fact that ALLAN is not essential during the asexual blood stage is noted in lines 628-29.

      Reviewer #2 (Evidence, reproducibility and clarity):

      Introduction

      Line 63: The authors stat: "NE is integral to mitosis, supporting spindle formation, kinetochore attachment, and chromosome segregation..". Seemingly at odds, they also say (Line 69) that 'open' "mitosis is "characterized by complete NE disassembly".

      The authors could explain better the ideas presented in their quoted review from Dey and Baum, which points out that truly 'open' and 'closed' topologies may not exist and that even in 'open' mitosis, remnants of the NE may help support the mitotic spindle.

      We have modified the sentence in which we discuss current opinions about ‘open’ and ‘closed’ mitosis. It is believed that there is no complete disassembly of the NE during open mitosis and no completely intact NE during closed mitosis, respectively. In fact, the NE plays a critical role in the different modes of mitosis during MTOC organisation and spindle dynamics. Please see the modified lines 64-71.

      Results

      Fig 7 is the final figure; but would be more useful upfront.

      We have provided a new introductory figure (Fig 1) showing a schematic of conventional /canonical LINC complexes and evidence of SUN protein functions in model eukaryotes and compare them to what is known in apicomplexans.

      Fig 1D. The authors generated a C-terminal GFP-tagged SUN1 transfectants and used ultrastructure expansion microscopy (U-ExM) and structured illumination microscopy (SIM) to examine SUN1-GFP in male gametocytes post-activation. The immuno-labelling of SUN1-GFP in these fixed cells appears very different to the live cell images of SUN1-GFP. The labelling profile comprises distinct punctate structures (particularly in the U-ExM images), suggesting that paraformaldehyde fixation process, followed by the addition of the primary and secondary antibodies has caused coalescing of the SUN1-GFP signal into particular regions within the NE.

      We agree with the reviewer. Fixation with paraformaldehyde (PFA) results in a coalescence of the SUN1-GFP signal. We have also tried methanol fixation (see new Fig. S2), but a similar problem was encountered.

      Given these fixation issues, the suggestion that the SUN1-GFP signal is concentrated at the BB/ nuclear MTOC and "enriched near spindle poles" needs further support.

      These statements seem at odd with the data for live cell imaging where the SUN1-GFP seems evenly distributed around the nuclear periphery. Can the observation be quantitated by calculating the percentage of BB/ nuclear MTOC structures with associated SUN1-GFP puncta? If not, I am not convinced these data help understand the molecular events.

      We agree with the reviewer that whilst the live cell imaging showed an even distribution of SUN1-GFP signal, after fixation with either PFA or methanol, then SUN1-GFP puncta are observed in addition to the peripheral location around the stained DNA (Hoechst) (See Fig. S2; puncta are indicated by arrows). These SUN1-GFP labelled puncta were observed at the junction of the nuclear MTOC and the basal body (Fig. 2F). Quantification of the distribution showed that these SUN1-GFP puncta are associated with nuclear MTOC in more than 90 % of cells (18 cells examined). Live cell imaging of the dual labelled parasites; SUN1xkinesin-8B (Fig. 2H) and SUN1x EB1 (Fig. 2I) provides further support for the association of SUN1-GFP puncta with BB (kinesin-8B) /nuclear MTOC (EB1).

      The authors then generated dual transfectants and examined the relative locations of different markers in live cells. These data are more informative.

      The authors state; " ..SUN1-GFP marked the NE with strong signals located near the nuclear MTOCs situated between the BB tetrads". The nuclear MTOCs are not labelled in this experiment. The SUN1-GFP signal between the kinesin-8B puncta is evident as small puncta on regions of NE distortion. I would prefer to not describe this signal as "strong". The signal is stronger in other regions of the NE.

      We have modified the sentence on line 213 to accommodate this suggestion.

      Line 219. The authors state; "..SUN1-GFP is partially colocalized with spindle poles as indicated by EB1,.. it shows no overlap with kinetochores (NDC80)." The authors should provide an analysis of the level of overlap at a pixel by pixel level to support this statement.

      We now provide the overlap at a pixel-by-pixel level for representative images, and we have quantified more cells (n>30), as documented in the new Fig. S4A. We have also modified the sentence on line 219 to reflect these additions.

      The SUN1 construct is C-terminally GFP-tagged. By analogy with human SUN1, the C-terminal SUN domain is expected to be in the NE lumen. That is in a different compartment to EB1, which is located in the nuclear lumen (on the spindle). Thus, the overlap of signal is expected to be minimal.

      We agree with the reviewer that the overlap between EB1 and Sun1 signals is expected to be minimal. We have quantified the data and included it in Supplementary Fig. S4A.

      Similarly, given that EB1 and NDC80 are known to occupy overlapping locations on the spindle, it seems unlikely that SUN1 can overlap with one and not the other.

      We agree with the reviewer’s analysis that EB1 and NDC80 occupy overlapping locations on the spindle, although the length of NDC80 is less at the ends of spindles (see Author response image 1A) as shown in our previous study where we compared the locations of two spindle proteins, ARK2 and EB1, with that of NDC80 (Zeeshan et al, 2022; PMID: 37704606). In the present study we observed that Sun1-GFP partially overlaps with EB1 at the ends of the spindle, but not with NDC80. Please see Author response image 1B.

      Author response image 1.

      I note on Line 609, the authors state "Our study demonstrates that SUN1 is primarily localized to the nuclear side of the NE.." As per Fig 7D, and as discussed above, the bulk of the protein, including the SUN1 domain, is located in the space between the INM and the ONM.

      We appreciate the reviewer’s correction; we have now modified the sentence to indicate that the protein is largely localized in the space between the INM and the ONM on line 617.

      Interestingly, as the authors point out, nuclear membrane loops are evident around EB1 and NDC80 focal regions. The data suggests that the contraction of the spindle pole back to the nuclear periphery induces distortion of the NE.

      We agree with the reviewer’s suggestion that the data indicate that contraction of spindle poles back to the nuclear periphery may induce distortion of the NE.

      The author should discuss further the overlap of findings of this study with that from a recent manuscript (https://doi.org/10.1016/j.cels.2024.10.008). That Sayers et al. study identified a complex of SUN1 and ALLC1 as essential for male fertility in P. berghei. Sayers et al. also provide evidence that this complex particulate in the linkage of the MTOC to the NE and is needed for correct mitotic spindle formation during male gametogenesis.

      We thank the reviewer for this suggestion. The study by Sayers et al, (2024) was published while our manuscript was under preparation. It was interesting to see that these complementary studies have similar findings about the role of SUN1 and the novel complex of SUN1-ALLAN. Our study contains a more detailed, in-depth analysis both by Expansion and TEM of SUN1. We include additional studies on the role of ALLAN.  We discuss the overlap in the findings of the two studies in lines 590-605.

      While the work is interesting, the conclusions may need to be tempered. The authors suggestion that in the absence of KASH-domain proteins, the SUN1-ALLAN complex forms a non-canonical LINC complex (that is, a connection across the NE), that "achieves precise nuclear and cytoskeletal coordination".

      We have toned down the wording of this conclusion in lines 665-677.

      In other organisms, KASH interacts with the C-terminal domain on SUN1, which as mentioned above is located between the INM and ONM. By contrast, ALLAN interacts with the N-terminal domain of SUN1, which is located in the nuclear lumen. The SUN1-ALLAN interaction is clearly of interest, and ALLAN might replace some of the roles of lamins. However, the protein that functionally replaces KASH (i.e. links SUN1 to the ONM) remains unidentified.

      We agree with reviewer, and future studies will need to focus on identifying the KASH replacement that links SUN1 to the ONM.

      It may also be premature to suggest that the SUN1-ALLAN complex is promising target for blocking malaria transmission. How would it be targeted?

      We have deleted the sentence that raised this suggestion.

      While the above datasets are interesting and internally consistent, there are two other aspects of the manuscript that need further development before they can usefully contribute to the molecular story.

      The authors undertook a transcriptomic analysis of Δsun1 and WT gametocytes, at 8 and 30 min post-activation, revealing moderate changes (~2-fold change) in different genes. GO-based analysis suggested up-regulation of genes involved in lipid metabolism. Given the modest changes, it may not be correct to conclude that "lipid metabolism and microtubule function may be critical functions for gametogenesis that can be perturbed by sun1 deletion." These changes may simply be a consequence of the stalled male gametocyte development.

      Following the reviewer’s suggestion we have moved these data to the supplementary information (Fig. S5D-I) and toned down their discussion in the results and discussion sections.

      The authors have then undertaken a detailed lipid analysis of the Δsun1 and WT gametocytes, before and after activation. Substantial changes in lipid metabolites might not be expected in such a short period of time. And indeed, the changes appear minimal. Similarly, there are only minor changes in a few lipid sub-classes between Δsun1 and WT gametocytes. In my opinion, the data are not sufficient to support the authors conclusion that "SUN1 plays a crucial role, linking lipid metabolism to NE remodelling and gamete formation."

      In agreement with the reviewer’s comments we have moved  these data to supplementary information (Fig. S6) and substantially toned down the conclusions based on these findings.

      Reviewer #3 (Evidence, reproducibility and clarity):

      Major comments:

      My main concern with this manuscript is that the authors do conclude not only that SUN1 is important for spindle formation and basal body segregation, but also that it influences for lipid metabolism and NE dynamics. I don't think the data supports this conclusion, for several reasons listed below. I would suggest to remove this claim from the manuscript or at least tone it down unless more supporting data are provided, in particular showing any change in NE dynamics in the SUN1-KO. Instead I would recommend to focus on the more interesting role of SUN1-ALLAN in bipartite MTOC organisation, which likely explains all observed phenotypes (including those in later stages of the parasite life cycle). In addition, some aspects of the knockout phenotype should be quantified to a bit deeper level.

      In more detail:

      - The lipidomics analysis is clearly the weakest point of the manuscript: The authors state that there are significant changes in some lipid populations between WT and sun1-KO, and between activated and non-activated cells, yet no statistical analysis is shown and the error bars are quite high compared to only minor changes in the means. For some discussed lipids, the result text does not match the graphs, e.g. PA, where the increase upon activation is more pronounced in the SUN1-KO vs WT (contrary to the text), or MAG, which is reduced in the SUN1-KO vs WT (contrary to the text). I don't see the discussed changes in arachidonic acid levels and myristic acid levels in the data either. Even if the authors find after analysis some statistically significant differences between some groups, they should carefully discuss the biological significance of these differences. As it is, I do not think the presented data warrants the conclusion that deletion of SUN1 changes lipid homeostasis, but rather shows that overall lipid homeostasis is not majorly affected by gametogenesis or SUN1 deletion. As a minor comment, if you decide to keep the lipidomics analysis in the manuscript, please state how many replicates were done.

      As detailed above we have moved the lipidomics data to supplementary information (Fig. S6) and substantially toned down the discussion of these data in the results and discussion sections.

      - I can't quite follow the logic why the authors performed transcriptomic analysis of the SUN1 and how they chose their time points. Their data up to this point indicate that SUN1 has a structural or coordinating role in the bipartite MTOC during male gametogenesis. Based on that it is rather unlikely that SUN1 KO directly leads to transcriptional changes within the 8 min of exflagellation. Isn't it more likely that transcriptional differences are purely a downstream effect of incomplete/failed gametogenesis? This is particularly true for the comparison at 30 min, which compares a mixture of exflagellated/emerged gametes and zygotes in WT to a mixture of aberrant, arrested gametes in the knockout, which will likely not give any meaningful insight. The by far most significant GO-term is then also nuclear-transcribed mRNA catabolic process, which is likely not related at all to SUN1 function (and the authors do not even comment on this in the main text). I would therefore suggest removing the 30 min data set from this manuscript. As a minor point, I would suggest highlighting some of the top de-regulated gene IDs in the volcano plots and stating their function. Also, please state how you prepared the cells for the transcriptomes and in how many replicates this was done.

      As suggested by the reviewer we have removed the 30 min post activation data from the manuscript. We have also moved the rest of the transcriptomics data to supplementary information (Fig. S5) and toned down the presentation of this aspect of the work in the results and discussion sections.

      - Live-cell imaging of SUN1-GFP does nicely visualise the NE during gametogenesis, showing a highly dynamic NE forming loops and folds, which is very exciting to see. It would be beneficial to also show a video from the life-cell imaging.

      We have now added videos to the manuscript as suggested by the reviewer. Please see the supplementary Videos S1 and S2.

      In their discussion, the authors state multiple times that NE dynamics are changed upon SUN1 KO. Yet, they do not provide data supporting this claim, i.e. that the extended loops and folds found in the nuclear envelope during gametogenesis are affected in any way by the knockout of SUN1 or ALLAN. What happens to the NE in absence of SUN1? Are there less loops and folds? In absence of a reliable NE marker this may not be entirely easy to address, but at least some SBF-SEM images of the sun1-KO gametocytes could provide insight.

      It was difficult to provide SBF-SEM images as that work is beyond the scope of this manuscript. We will consider this approach in our future work. We re-examined many of our TEM images of SUN1-KO and ALLAN-KO parasites and did find some micrographs showing aberrant nuclear membrane folding (<5%) (Please see Author response image 2). However, we also observed similar structures in some of the WT-GFP samples (<5%), so we do not think this is a strong phenotype of the SUN1 or ALLAN mutants.

      Author response image 2.

       

      - I think the exciting part of the manuscript is the cell biological role of SUN1 on male gametogenesis, which could be carved out a bit more by a more detailed phenotyping. Specifically it would be good to quantify

      (1) If DNA replication to an octoploid state still occurs in SUN1-KO and ALLAN-KO,

      DNA replication is not affected in the SUN1-KO and ALLAN-KO mutants: DNA content increases to 8N (data added in Fig. 3J and Fig. S10F).

      (2) The proportion of anucleated gametes in WT and the KO lines

      We have added these data in Fig. 3K and Fig. S10G

      (3) A quantification of the BB clustering phenotype (in which proportion of cells do the authors see this phenotype). This could be addressed by simple fixed immunofluorescence images of the respective WT/KO lines at various time points after activation (or possibly by reanalysis of the already obtained images) and would really improve the manuscript.

      We have reanalysed the BB clustering phenotype and added the quantitative data in Fig. 4E and Fig. S7.

      Especially the claim that emerged SUN1-KO gametes lack a nucleus is currently only based on single slices of few TEM cells and would benefit from a more thorough quantification in both SUN1- and ALLAN-Kos

      We have examined many microgametes (100+ sections). In WT parasites a small proportion of gametes can appear to lack a nucleus if it does not extend all the way to the apical and basal ends (Hair et al. 2022). However, the proportion of microgametes that appear to lack a nucleus (no nucleus seen in any section) was much higher in the SUN1 mutant. In contrast, this difference was not as clear cut in the ALLAN mutant with a small proportion of intact (with axoneme and nucleus) microgametes being observed.

      We have done additional analysis of male gametes, looking for the presence of the nucleus by live cell imaging after DNA staining with Hoechst. These data are added in Fig. 3K (for Sun1-KO) and Fig. S10G (for Allan-KO).

      - The TEM suggests that in the SUN1-KO, kinetochores are free in the nucleus. Are all kinetochores free or do some still associate to a (minor/incorrectly formed) spindle? The authors could address this by tagging NDC80 in the KO lines.

      Our observation and quantification of the data indicated that 100% of kinetochores were attached to spindle microtubules and that 0% were unattached kinetochores in the WT parasites. However, the exact opposite was found for the SUN1 mutant with 100% unattached kinetochores and 0% attached. The result was not quite as clear cut in the ALLAN mutant, with 98% unattached and 2% attached. An important observation was the lack of separation of the nuclear poles and any spindle formation. Spindle formation was never or very rarely observed in the mutants.

      - Finally, I think it is curious that in contrast to SUN1, ALLAN seems to be less important, with some KO parasite completing the life cycle. Maybe a more detailed phenotyping as above gives some more hints to where the phenotypic difference between the two proteins lies. I would assume some ALLAN-KO cells can still segregate the basal body. Can the authors speculate/discuss in more detail why these two proteins seems to have slightly different phenotypes?

      We agree with the reviewer. Overall, the ALLAN-KO has a less prominent phenotype than that of the Sun1-KO. The main difference is that in the ALLAN-KO mutant some basal body segregation can occur, leading to the production of some fertile microgametocytes, and ookinetes, and oocyst formation (Fig. 8). Approximately 5% of oocysts sporulated to release infective sporozoites that could infect mice in bite back experiments and complete the life cycle. In contrast the Sun1-KO mutant made no healthy oocysts, or infective sporozoites, and could not complete the life cycle in bite back experiments. We have analysed the phenotype in detail and provide quantitative data for gametocyte stages by EM and ExM in Figs. 4 and S8 (SUN1) and Figs. 7 and S11 (ALLAN). We have also performed detailed analysis of oocyst and sporozoite stages and included the data in Fig. 3 (SUN1) and S10 (ALLAN).

      Based on the location, and functional and interactome data, we think that SUN1 plays a central role in coordinating nucleoplasm and cytoplasmic events as a key component of the nuclear membrane lumen, whereas ALLAN is located in the nucleoplasm. Deleting the SUN1 gene may disrupt the connection between INM and ONM whereas the deletion of ALLAN may affect only the INM.

      Some additional points where the data is not entirely sound yet or could be improved:

      - Localisation of SUN1: There seems to be a discrepancy between SUN1-GFP location as observed by live cell microscopy, and by Expansion Microscopy (ExM), similar for ALLAN-GFP. By live-cell microscopy, the SUN1 localisation is much more evenly distributed around the NE, while the localisation in ExM is much more punctuated, and e.g. in Figure 1E seems to be within the nucleus. Do the authors have an explanation for this? Also, in Fig. 1D there are two GFP foci at the cell periphery (bottom left of the image), which I would think are not SUN1-Foci, as they seem to be outside of the cell. Is the antibody specific? Was there a negative control done for the antibody (WT cells stained with GFP antibodies after ExM)?

      High resolution SIM and expansion microscopy showed that the SUN1-GFP molecules coalesce to form puncta, in contrast to the more uniform distribution observed by live cell imaging. This apparent difference may be due to a better resolution that could not be achieved by live cell imaging. We agree with the reviewer that the two green foci are outside of the cell. As a negative control we have used WT-ANKA cells (which contain no GFP) and the anti-GFP antibody, which gave no signal. This confirms the specificity of the antibody (please see the new Fig. S3). 

      - The authors argue that SIM gave unexpected results due to PFA fixation leading to collapse of the NE loops. However, they also fix their ExM cells and their EM cells with PFA and do not observe a collapse, at least from what I see in the two presented images and in the 3D reconstruction. Is there something else different in the sample preparation?

      There was no difference in the fixation process for samples examined by SIM and ExM, but we used an anti-GFP antibody in ExM to visualise the SUN1-GFP, while in SIM the images of GFP signal were collected directly after fixation.  We used both PFA and methanol as fixative, and both methods showed a coalescing of the SUN1-GFP signal (please see the new Fig. S2 and S3).

      Can the authors trace their NE in ExM according to the NHS-Ester signal?

      We could trace the NE in the ExM by the NHS-ester signal and observed that the SUN1-GFP signal was largely coincident with the NE (Please see the new Fig. S3B).

      - Fig 2D: It would be good to not just show images of oocysts but actually quantify their size from images. Also, have the authors determined the sporozoite numbers in SUN1-KO?

      We have measured oocyst size (data added in new Fig. 3) and added the sporozoite quantification data in Fig. 3D.

      - Line 481-483: the authors state that oocyst size is reduced in ALLAN-KO but do not show the data. Please quantify oocyst size or at least show representative images. Also the drastic decrease in sporozoite numbers (Fig. 6D, E) is not mentioned in the text. Please add reference to Fig S7D when talking about the bite back data.

      We have added the oocyst size data in Fig. S10. We mention the changes in sporozoite numbers (now  shown in Fig. 7D, E), and refer to  the bite back data shown in current Fig. 7E.

      - Fig S1C, 6C: Both WB images are stitched, but this is not clearly indicated e.g. by leaving a small gap between the lanes. Also please show a loading control along with the western blots. Also there seems to be a (unspecific?) band in the control, running at the same height as Allan-GFP WB. What exactly is the control?

      We have provided the original blot showing the bands of ALLAN-GFP and SUN1-GFP. As a positive control, we used an RNA associated protein (RAP-GFP) that is highly expressed in Plasmodium and regularly used in our lab for this purpose.

      - Regarding the crossing experiment: The authors conclude from this cross that SUN1 is only needed in males, yet for this conclusion they would need to also show that a cross with a female line does not rescue the phenotype. The authors should repeat the cross with a male-deficient line to really test if the phenotype is an exclusively male phenotype. In addition, line 270-272 states that no oocysts/sporozoites were detected in sun1-ko and nek4-ko parasites. However, the figure 2E shows only oocysts, not sporozoites, and shows also that sun1-ko does form oocysts, albeit dead ones.

      We have now performed the experiment of crossing the Sun1-KO parasite line with a male deficient line (Hap2-KO) and added the data in Fig. 3I. We have added images showing sporozoites in oocysts.

      - In Fig S1 the authors show that they also generated a SUN1-mCherry line, yet they do not use it in any of the presented experiments (unless I missed it). Would it be beneficial to cross the SUN1-mCherry line with the Allan1-GFP line to test colocalisation (possibly also by expansion microscopy)?

      We did generate a SUN1-mCherry line, with the intent to cross ALLAN-GFP and SUN1-mCherry lines and observe the co-location of the proteins. Despite multiple attempts this cross was unsuccessful. This may have been due to their close proximity such that the addition of both GFP and mCherry was difficult to facilitate a proper protein-protein interaction between either of the proteins.

      - Line 498: "In a significant proportion of cells" - What was the proportion of cells, and what does significant mean in this context?

      Approximately 67% of cells showed the clumping of BBs. We have now added the numbers in Figs. 6H and S11I.

      - The authors should discuss a bit more how their work relates to the work of Sayers et al. 2024, which also identified the SUN1-ALLAN complex. The paper is cited, but only very briefly commented on.

      We have extended this discussion now in lines 590-605.

      Suggestions how to improve the writing and data presentation.

      - General presentation of microscopy images: Considering that large parts of the manuscript are based on microscopy data, their presentation could be improved. Single-channel microscopy images would benefit from being depicted in gray scale instead of color, which would make it easier to see the structures and intensities (especially for blue channels).

      Whilst we agree with the reviewer, sometimes it is difficult to see the features in the merged images. Therefore, we would like to request to be allowed to retain the colours, which can be easily followed in both individual and merged images.

      Also, it would be good to harmonize in which panels arrows are shown (e.g. Fig 1G, where some white arrows are in the SUN1-GFP panel, while others are in the merge panel, but they presumably indicate the same thing.). At the same time, Fig 1H doesn't have any with arrows, even though the figure legend states so.

      We apologise for this lack of consistency, and we have now added arrows wherever they are missing to harmonise in the presentations.

      Fig 3A and S4 show the same experiment but are coloured in different colours (NHS-Eester in green vs grey scale).

      - Are the scale bars of all expansion microscopy images adjusted for the expansion factor?

      Yes, the scale bars are adjusted accordingly.

      - The figure legends would benefit from streamlining, as they have very different style between figures (eg Fig. 6 which has a concise figure legend vs microscopy figures where figure legends are very long and describe not only the figure but the results)

      The figure legends have been streamlined, with removal of the description of results.

      - Line 155-156: The text makes it sound like the expression only happens after activation. is that the case? Are these images activated or non-activated gametocytes?

      They are expressed before activation, but the signal intensifies after activation. Images from before and after activation of gametocytes have been added in Fig. S1F.

      - Line 267: Reference to the original nek4-KO paper missing

      This reference is now included.

      - Line 301: The reference to Figure 2J seems to be a bit arbitrarily placed. Also, this schematic of lipid metabolism is never discussed in relation to the transcriptomic or lipidomic data.

      We have moved these data to supplementary information and modified the text.

      - Line 347-349 states that gametes emerged, but the referenced figure shows activated gametocytes before exflagellation.

      We have corrected the text to the start of exflagellation.

      - Line 588: Spelling mistake in SUN1-domain

      Corrected.

      - Line 726/731: i missing in anti-GFP

      Corrected.

      - Line 787-789: statement of scale bar and number of cells imaged is not at the right position in the figure legend.

      Moved to right place

      - Line 779, 783: "shades of green" should be just "green". Same goes for line 986, 989 with "shades of grey"

      Changed.

      - Line 974, 976: please correct to WT-GFP and dsun1

      Corrected.

      - Line 1041, 1044: WT-GFP instead of WTGFP.

      Corrected to WT-GFP.

      - Fig 1B, D, E, Fig S1G, H: What are the time points of imaging?

      We have added the time points to the images in these figures.

      - Fig 1D/Line 727: the scale of the scale bar on the inset is missing.

      We have added the scale bar.

      - Fig 3 E-G and 6H-J: Please indicate total number of cells/images analysed per quantification, either in the graphs themselves or in the figure legend.

      We indicate now the number of cells analysed in individual figures and also in Fig. S5C and S8C, respectively.

      - Fig 5B: What is NP

      Nuclear Pole (NP), also known as the nuclear/acentriolar MTOC (Zeeshan et al 2022; PMID: 35550346).

      - Fig S1B/D: The legend states that there is an arrow indicating the band, but there is none.

      We have added the arrow.

      - Fig S2C: Is the scale bar really the same for the zygote and the ookinete?

      We have checked this and used the same for both zygote and ookinete.

      - Fig S3C, S7C: which stages was qRT-PCR done on?

      Gametocytes activated for 8 min.

      - Fig. S3D, S7D: According to the figure legend, three independent experiments were performed. How many mice were used per experiment? It would be good to depict the individual data points instead of the bar graph. For S7D, 3 data points are depicted (one in WT, two in allan-KO), what do they mean?

      The bite back experiment was performed using 15-20 mosquitoes infected with WT-GFP and gene knockout lines to feed on one naïve mouse each, in three different experiments. We have now included the data points in the bar diagrams.

      - Fig S3: Panel letters E and G are missing

      We have updated the lettering in current Fig. S5

      - Fig 3D: Please indicate what those boxes are. I presume that these are the insets show in b, e and j, but it is never mentioned. J is not even larger than i. Also, f is quite cropped, it would be good to see the large-scale image it comes from to see where in the nucleus these kinetochores are placed. Were there unbound kinetochores found in WT?

      We mention the boxes in the figure legends. It is rare to find unbound kinetochores in WT parasite. We provide large scale and zoomed-in images of free kinetochores in Fig. S8.

      - Fig S4: Insets are not mentioned in the figure legend. Please add scale bar to zoom-ins

      We now describe the insets in the figure legends and have added scale bars to the zoomed-in images.

      - Fig S5A, B: Please indicate which inset belongs to which sub-panel. Where does Ac stem from?

      We have now included the full image showing the inset (new Fig. S8).

      - Fig S5C and S8C: Change "DNA" to "Nucleus".

      We have changed “DNA” to “Nucleus”. Now they are Fig. S8K and S11I.

      Reviewer #3 (Significance):

      Yet, the statement that SUN1 is also important for lipid homoeostasis and NE dynamics is currently not backed up by sufficient data. I believe that the manuscript would benefit from removing the less convincing transcriptomic and lipidomic datasets and rather focus on more deeply characterising the cell biology of the knockouts. This way, the results would be interesting not only for parasitologists, but also for more general cell biologists.

      We have moved the lipidomics and transcriptomics data to supplementary information and toned down the emphasis on these data to make the manuscript more focused on the cell biology and analysis of the genetic KO data.

    1. Author response:

      eLife assessment

      This study is a detailed investigation of how chromatin structure influences replication origin function in yeast ribosomal DNA, with focus on the role of the histone deacetylase Sir2 and the chromatin remodeler Fun30. Convincing evidence shows that Sir2 does not affect origin licensing but rather affects local transcription and nucleosome positioning which correlates with increased origin firing. However, the evidence remains incomplete as the methods employed do not rigorously establish a key aspect of the mechanism, fully address some alternative models, or sufficiently relate to prior results. Overall, this is a valuable advance for the field that could be improved to establish a more robust paradigm.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      This paper presents a mechanistic study of rDNA origin regulation in yeast by SIR2. Each of the ~180 tandemly repeated rDNA gene copies contains a potential replication origin. Early-efficient initiation of these origins is suppressed by Sir2, reducing competition with origins distributed throughout the genome for rate-limiting initiation factors. Previous studies by these authors showed that SIR2 deletion advances replication timing of rDNA origins by a complex mechanism of transcriptional de-repression of a local PolII promoter causing licensed origin proteins (MCMcomplexes) to re-localize (slide along the DNA) to a different (and altered) chromatin environment. In this study, they identify a chromatin remodeler, FUN30, that suppresses the sir2∆ effect, and remarkably, results in a contraction of the rDNA to about one-quarter it's normal length/number of repeats, implicating replication defects of the rDNA. Through examination of replication timing, MCM occupancy and nucleosome occupancy on the chromatin in sir2, fun30, and double mutants, they propose a model where nucleosome position relative to the licensed origin (MCM complexes) intrinsically determines origin timing/efficiency. While their interpretations of the data are largely reasonable and can be interpreted to support their model, a key weakness is the connection between Mcm ChEC signal disappearance and origin firing. While the cyclical chromatin association-dissociation of MCM proteins with potential origin sequences may be generally interpreted as licensing followed by firing, dissociation may also result from passive replication and as shown here, displacement by transcription and/or chromatin remodeling.

      While it is true that both transcription and passive replication can cause the signal of MCM-ChEC to disappear, neither can cause selective disappearance of the displaced complex without affecting the non-displaced complex.  Indeed, in the case of transcription, RNA polymerase transcribing C-pro would have to first dislodge the normally positioned MCM complex before even reaching the displaced complex.  Furthermore, deletion of FUN30 leads to both more C-pro transcription and less disappearance of the displaced MCM complex.  It is important to keep in mind that this cannot somehow reflect continuous replenishment of displaced MCMs with newly loaded MCMs, since the cells are in S phase and licensing is restricted to G1. 

      Moreover, linking its disappearance from chromatin in the ChEC method with such precise resolution needs to be validated against an independent method to determine the initiation site(s). Differences in rDNA copy number and relative transcription levels also are not directly accounted for, obscuring a clearer interpretation of the results.

      Copy number reduction of the magnitude caused by deletion of SIR2 and FUN30 does not suppress the sir2D effect (i.e. early replication of the rDNA), but rather exacerbates it.  In particular, deletion of SIR2 and FUN30 causes the rDNA to shrink to approximately 35 copies.  Kwan et al., 2023 (PMID: 36842087) have shown that reduction of rDNA copy number to 35 causes a dramatic acceleration of rDNA replication in a SIR2 strain.  Thus, the effect of rDNA size on replication timing reinforces our conclusion that deletion of FUN30 suppresses rDNA replication.

      However, to address this concern directly, in the revision we will include 2 D gels in fob1 strains with equal number of repeats that allows to conclude that the effect of FUN30 deletion in suppressing rDNA origin firing is independent of either rDNA size or FOB1. The figure of the critical 2 D gels is shown below in the reply to reviewer 2.

      Nevertheless, this paper makes a valuable advance with the finding of Fun30 involvement, which substantially reduces rDNA repeat number in sir2∆ background. The model they develop is compelling and I am inclined to agree, but I think the evidence on this specific point is purely correlative and a better method is needed to address the initiation site question. The authors deserve credit for their efforts to elucidate our obscure understanding of the intricacies of chromatin regulation. At a minimum, I suggest their conclusions on these points of concern should be softened and caveats discussed. Statistical analysis is lacking for some claims.

      Strengths are the identification of FUN30 as suppressor, examination of specific mutants of FUN30 to distinguish likely functional involvement. Use of multiple methods to analyze replication and protein occupancies on chromatin. Development of a coherent model.

      Weaknesses are failure to address copy number as a variable; insufficient validation of ChEC method relationship to exact initiation locus; lack of statistical analysis in some cases. 

      The two potential initiation sites that one would monitor (non-displaced and displaced) are separated by less than 150 base pairs, and other techniques simply do not have the resolution necessary to distinguish such differences.  Furthermore, as we suggest in the manuscript, our results are consistent with a model in which it is only the displaced MCM complex that is activated, whether in sir2 or WT.  If no genotype-dependent difference in initiation sites is even expected, it would be hard to interpret even the most precise replication-based assays.  However, the reviewer is correct that this is a novel technique and that confirmation with a well-established technique is comforting, therefore we are performing ChIP experiments to corroborate, to the extent possible, the conclusions that we reached with ChEC. 

      We appreciate the reviewer pointing out that some statistical analyses were lacking, and we will correct this in a revised manuscript.

      Additional background and discussion for public review:

      This paper broadly addresses the mechanism(s) that regulate replication origin firing in different chromatin contexts. The rDNA origin is present in each of ~180 tandem repeats of the rDNA sequence, representing a high potential origin density per length of DNA (9.1kb repeat unit). However, the average origin efficiency of rDNA origins is relatively low (~20% in wild-type cells), which reduces the replication load on the overall genome by reducing competition with origins throughout the genome for limiting replication initiation factors. Deletion of histone deacetylase SIR2, which silences PolII transcription within the rDNA, results in increased early activation or the rDNA origins (and reduced rate of overall genome replication). Previous work by the authors showed that MCM complexes loaded onto the rDNA origins (origin licensing) were laterally displaced (sliding) along the rDNA, away from a well-positioned nucleosome on one side. The authors' major hypothesis throughout this work is that the new MCM location(s) are intrinsically more efficient configurations for origin firing. The authors identify a chromatin remodeling enzyme, FUN30, whose deletion appears to suppress the earlier activation of rDNA origins in sir2∆ cells. Indeed, it appears that the reduction of rDNA origin activity in sir2∆ fun30∆ cells is severe enough to results in a substantial reduction in the rDNA array repeat length (number of repeats); the reduced rDNA length presumably facilitates it's more stable replication and maintenance.

      Analysis of replication by 2D gels is marginally convincing, using 2D gels for this purpose is very challenging and tricky to quantify. The more quantitative analysis by EdU incorporation is more convincing of the suppression of the earlier replication caused by SIR2 deletion.

      To address the mechanism of suppression, they analyze MCM positioning using ChEC, which in G1 cells shows partial displacement of MCM from normal position A to positions B and C in sir2∆ cells and similar but more complete displacement away from A to positions B and C in sir2fun30 cells. During S-phase in the presence of hydroxyurea, which slows replication progression considerably (and blocks later origin firing) MCM signals redistribute, which is interpreted to represent origin firing and bidirectional movement of MCMs (only one direction is shown), some of which accumulate near the replication fork barrier, consistent with their interpretation. They observe that MCMs displaced (in G1) to sites B or C in sir2∆ cells, disappear more rapidly during S-phase, whereas the similar dynamic is not observed in sir2∆fun30∆. This is the main basis for their conclusion that the B and C sites are more permissive than A. While this may be the simplest interpretation, there are limitations with this assay that undermine a rigorous conclusion (additional points below). The main problem is that we know the MCM complexes are mobile so disappearance may reflect displacement by other means including transcription which is high is the sir2∆ background. Indeed, the double mutant has greater level of transcription per repeat unit which might explain more displaced from A in G1. Thus, displacement might not always represent origin firing. Because the sir2 background profoundly changes transcription, and the double mutant has a much smaller array length associated with higher transcription, how can we rule out greater accessibility at site A, for example in sir2∆, leading to more firing, which is suppressed in sir2 fun30 due to greater MCM displacement away from A?

      I think the critical missing data to solidly support their conclusions is a definitive determination of the site(s) of initiation using a more direct method, such as strand specific sequencing of EdU or nascent strand analysis. More direct comparisons of the strains with lower copy number to rule out this facet. As discussed in detail below, copy number reduction is known to suppress at least part of the sir2∆ effect so this looms over the interpretations. I think they are probably correct in their overall model based on the simplest interpretation of the data but I think it remains to be rigorously established. I think they should soften their conclusions in this respect.

      Reviewer #2 (Public Review):

      Summary:

      In this manuscript, the authors follow up on their previous work showing that in the absence of the Sir2 deacetylase the MCM replicative helicase at the rDNA spacer region is repositioned to a region of low nucleosome occupancy. Here they show that the repositioned displaced MCMs have increased firing propensity relative to non-displaced MCMs. In addition, they show that activation of the repositioned MCMs and low nucleosome occupancy in the adjacent region depend on the chromatin remodeling activity of Fun30.

      Strengths:

      The paper provides new information on the role of a conserved chromatin remodeling protein in the regulation of origin firing and in addition provides evidence that not all loaded MCMs fire and that origin firing is regulated at a step downstream of MCM loading.

      Weaknesses:

      The relationship between the author's results and prior work on the role of Sir2 (and Fob1) in regulation of rDNA recombination and copy number maintenance is not explored, making it difficult to place the results in a broader context. Sir2 has previously been shown to be recruited by Fob1, which is also required for DSB formation and recombination-mediated changes in rDNA copy number. Are the changes that the authors observe specifically in fun30 sir2 cells related to this pathway? Is Fob1 required for the reduced rDNA copy number in fun30 sir2 double mutant cells? 

      Strains lacking SIR2 have unstable rDNA size, and FOB1 deletion stabilizes rDNA size in sir2 background. Likewise, FOB1 deletion influences the kinetics  rDNA size reduction in sir2 fun30 cells. However, the main effect of Fun30 in sir2 cells we were interested in, suppression of rDNA replication, is preserved in fob1 background, arguing that the observed effect is independent of Fob1 (see figure below). Given that the main focus of the paper is regulation of rDNA origins activity and that these changes were independent of Fob1, we had elected not to include these results in the original manuscript but will gladly include them in the revision.

      Besides refuting the possible role of Fob1 in the FUN30-mediated activation of rDNA origin firing in sir2 cells, the use of fob1 background enabled us compare the activation of rDNA origins in the sir2 and sir2 fun30 strains with equally short rDNA size. The 2-D gels demonstrate a dramatic suppression of rDNA origin activity upon deletion of FUN30 in the sir2 fob1 strains with 35 rDNA copies.

      Author response image 1.

      The deletion of FUN30 diminishes the replication bubble signal in a fob1 sir2 strain with 35 rDNA copies by more than tenfold. The single rARS signal, marked with the arrow, originates from the rightmost rDNA repeat. This specific rightmost rDNA NheI fragment is approximately 25 kb in size, distinctly larger than the 4.7 kb NheI 1N rARS-containing fragments that originate from the internal rDNA repeats.

      Reviewer #3 (Public Review):

      Summary:

      Heterochromatin is characterized by low transcription activity and late replication timing, both dependent on the NAD-dependent protein deacetylase Sir2, the founding member of the sirtuins. This manuscript addresses the mechanism by which Sir2 delays replication timing at the rDNA in budding yeast. Previous work from the same laboratory (Foss et al. PLoS Genetics 15, e1008138) showed that Sir2 represses transcription-dependent displacement of the Mcm helicase in the rDNA. In this manuscript, the authors show convincingly that the repositioned Mcms fire earlier and that this early firing partly depends on the ATPase activity of the nucleosome remodeler Fun30. Using read-depth analysis of sorted G1/S cells, fun30 was the only chromatin remodeler mutant that somewhat delayed replication timing in sir2 mutants, while nhp10, chd1, isw1, htl1, swr1, isw2, and irc5 had not effect. The conclusion was corroborated with orthogonal assays including two-dimensional gel electrophoresis and analysis of EdU incorporation at early origins. Using an insightful analysis with an Mcm-MNase fusion (Mcm-ChEC), the authors show that the repositioned Mcms in sir2 mutants fire earlier than the Mcm at the normal position in wild type. This early firing at the repositioned Mcms is partially suppressed by Fun30. In addition, the authors show Fun30 affects nucleosome occupancy at the sites of the repositioned Mcm, providing a plausible mechanism for the effect of Fun30 on Mcm firing at that position. However, the results from the MNAse-seq and ChEC-seq assays are not fully congruent for the fun30 single mutant. Overall, the results support the conclusions providing a much better mechanistic understanding how Sir2 affects replication timing at rDNA.

      The reason that the results for the fun30 single mutant appear incongruent, with a larger signal of the +2 nucleosome in the MNase-seq plot but a negligible signal in the ChEC-seq plot is the paucity of displaced Mcm in the fun30 single mutant. Given the relative absence of displaced MCMs, the MCM-MNase fusion protein can't "light up" the +2 nucleosome.  We will comment on this in the revision to clarify this. 

      Strengths

      (1) The data clearly show that the repositioned Mcm helicase fires earlier than the Mcm in the wild type position.

      (2) The study identifies a specific role for Fun30 in replication timing and an effect on nucleosome occupancy around the newly positioned Mcm helicase in sir2 cells.

      Weaknesses

      (1) It is unclear which strains were used in each experiment.

      (2) The relevance of the fun30 phospho-site mutant (S20AS28A) is unclear.

      (3) For some experiments (Figs. 3, 4, 6) it is unclear whether the data are reproducible and the differences significant. Information about the number of independent experiments and quantitation is lacking. This affects the interpretation, as fun30 seems to affect the +3 nucleosome much more than let on in the description.

      We appreciate the reviewer pointing out places in which our manuscript omitted key pieces of information (items 1 and 3), and we will fix these oversights in our revision. 

      With regard to point 2, we had written: 

      “Fun30 is also known to play a role in the DNA damage response; specifically, phosphorylation of Fun30 on S20 and S28 by CDK1 targets Fun30 to sites of DNA damage, where it promotes DNA resection (Chen et al. 2016; Bantele et al. 2017). To determine whether the replication phenotype that we observed might be a consequence of Fun30's role in the DNA damage response, we tested non-phosphorylatable mutants for the ability to suppress early replication of the rDNA in sir2; these mutations had no effect on the replication phenotype (Figure 2B), arguing against a primary role for Fun30

      in DNA damage repair that somehow manifests itself in replication.”

      We will expand on this to clarify our point in the revision.

    1. Author rsponse:

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      In this paper, the authors have performed an antigenic assay for human seasonal N1 neuraminidase using antigens and mouse sera from 2009-2020 (with one avian N1 antigen). This shows two distinct antigen groups. There is poorer reactivity with sera from 2009-2012 against antigens from 2015-2019, and poorer reactivity with sera from 2015-2020 against antigens from 2009-2013. There is a long branch separating these two groups. However, 321 and 423 are the only two positions that are consistently different between the two groups. Therefore these are the most likely cause of these antigenic differences.

      Strengths:

      (1) A sensible rationale was given for the choice of sera, in terms of the genetic diversity.

      (2) There were two independent batches of one of the antigens used for generating sera, which demonstrated the level of heterogeneity in the experimental process.

      (3) Replicate of the Wisconsin/588/2019 antigen (as H1 and H6) is another useful measure of heterogeneity.

      (4) The presentation of the data, e.g. Figure 2, clearly shows two main antigenic groups.

      (5) The most modern sera are more recent than other related papers, which demonstrates that has been no major antigenic change.

      Weaknesses:

      (1) Issues with experimental methods

      As I am not an experimentalist, I cannot comment fully on the experimental methods. However, I note that BALB/c mice sera were used, whereas outbred ferret sera are typically used in influenza antigenic characterisation, so the antigenic difference observed may not be relevant in humans. Similarly, the mice were immunised with an artificial NA immunogen where the typical approach would be to infect the ferret with live virus intra-nasally.

      Indeed, ferrets are the gold standard model for the study of influenza. The main reason for this is the susceptibility of ferrets to infection with primary human influenza virus isolates and their ability to transmit human influenza A and B viruses. Although mouse models often require the use of mouse-adapted influenza virus strains, it is still the most used model to study new developments on influenza vaccine.

      In our previous publication we performed a parallel analysis of sera of ferrets that were primed by infection and boosted by recombinant protein, as well as mice that, like in this study that focuses on N1 NA, were prime-boosted with purified recombinant NA proteins in the presence of an adjuvant. Our data indicate that the NAI responses in immune sera from infected ferrets after infection and after boost enables similar antigenic classification and correlated strongly with those induced in mice that had been prime-boosted with adjuvanted recombinant NA (Catani et al., eLife 2024). To a large extend, the immunogenicity of an antigen relies on epitope accessibility, which may dictate a universal rule of immunogenicity and antigenicity (Altman et al., 2015).

      (2) Five mice sera were generated per immunogen and then pooled, but data was not presented that demonstrated these sera were sufficiently homogenous that this approach is valid.

      Although individual sera was not tested here. Based on previous studies from our group we are confident that a prime-boost schedule with 1 µg of adjuvanted soluble tetrameric NA, induces a highly homogeneous response in mice (Catani et al., 2022).

      (3) There were no homologous antigens for most of the sera. This makes the responses difficult to interpret as the homologous titre is often used to assess the overall reactivity of a serum. The sequence of the antigens used is not described, which again makes it difficult to interpret the results.

      The absence of homologous antigens may indeed make interpretation more difficult. However, we have observed that homologous sera do not always coincide with the highest reactivity, although highest reactivity is always found within an antigenic cluster. A sequence comparison would be appropriate to improve interpretability of the data. Therefore, a sequence alignment and a pairwise comparison will be provided in the revised manuscript as supplement. 

      (4) To be able to untangle the effects of the individual substitutions at 321, 386, and 432, it would have been useful to have included the naturally occurring variants at these positions, or to have generated mutants at these positions. Gao et al clearly show an antigenic difference with ferret sera correlated separately with N386K and I321V/K432E.

      The prevalence of single amino acid substitutions in N1 NA of clinical H1N1 virus strains isolated between 2009 and 2024 is minimal, which may indicate reduced fitness (see Author response image 1) in strains with these substitutions in NA. Nevertheless, we agree that the rescue of single mutants would provide important evidence to untangle those individual impacts on antigenicity. We plan to generate mutants with substitution at these positions in NA of A/Wisconsin/588/2019 H1N1 and determine the NAI against our panel of sera.

      Author response image 1.

      Prevalence of the indicated N1 NA substitutions in all clinical human H1N1 isolates with unique sequences deposited in the GISAID data bank since 2009.

      (5) The challenge experiments in Gao et al showed that NI titre was not a good correlate of protection, so that limits the interpretation of these results.

      On the contrary, challenges experiments confirmed that drift occurred in NA from H1N1 viruses isolated between 2009 (CA/09) and 2015 (MI/15). The dilution of transferred sera to equal inhibitory titers indicate that the homologous ferret sera (shown in figure 5e-f)(Gao et al., 2019) is still effective in protecting against infection while heterologous sera are not. This result emphasises that the nature of the homologous NAI response is well-suited for protection against a homologous challenge, although mechanistic data was not provided.

      Issues with the computational methods

      (6) The NAI titres were normalised using the ELISA results, and the motivation for this is not explained. It would be nice to see the raw values.

      Mice were immunized with different batches of recombinant protein. Each of those batches may have distinct intrinsic immunogenicity, as observed in Figure 1d. For that reason, NAI values were normalized using homologous ELISA titers induced by each respective NA antigen. A table with the raw values will be included in the revised manuscript.

      (7) It is not clear what value the random forest analysis adds here, given that positions 321 and 432 are the only two that consistently differ between the two groups.

      The substitutions at position 321 and 432 are indeed the only 2 consistently differing amino acids among the tested N1s. Although their correlation with antigenic clustering may be obvious after analysis, a random forest analysis would enable to reveal less obvious substitutions that contribute to the antigenic diversity. In the future, we intend to expand this methodology to strains that are not currently included in the panel. A random forest model is a relatively simple and performant method to deal with a new dataset.

      (8) As with the previous N2 paper, the metric for antigenic distance (the root mean square of the difference between the titres for two sera) is not one that would be consistent when different sera are included. More usual metrics of distance are Archetti-Horsfall, fold down from homologous, or fold down from maximum.

      The antigenic distances calculated prior to our random forest does use fold-difference as metrics as log2(max(EC50) / EC50). After having obtained the fold-difference values, a pairwise dissimilarity matrix was calculated to obtain the average antigenic distance between pairs of sera. A more detailed description of the methodology will be included in the methods session, including the R-code.

      (9) Antigenic cartography of these data is fraught. I wonder whether 2 dimensions are required for what seems like a 1-dimensional antigenic difference - certainly, the antigens, excluding the H5N1, are in a line. The map may be skewed by the high reactivity Brisbane/18 antigen. It is not clear if the column bases (normalisation factors for calculating antigenic distance) have been adjusted to account for the lack of homologous antigens. It is typical to present antigenic maps with a 1:1 x:y ratio.

      Antigenic cartography will be repeated excluding H5N1 and/or Brisbane/18 antigen. Data will be provided in the final rebuttal letter.

      Issues with interpretation

      (10) Figure 2 shows the NAI titres split into two groups for the antigens, however, A/Brisbane is an outlier in the second antigenic group with high reactivity.

      Indeed, A/Brisbane/02/2018 has overall higher IC50 values. However, it still falls into the same cluster that we called AG2. Highlighting A/Brisbane/02/2018 may lead to the misinterpretation of a non-existent antigenic group. 

      (11) Following Gao et al, I think you can claim that it is more likely that the antigenic change is due to K432E than I321V, based on a comparison of the amino acid change.

      Indeed, we would expect that substitution of the basic arginine to an acidic glutamate is more likely to impact antigenicity than the isoleucine-to-valine apolar substitution. Testing of mutant reassortants with single mutations may provide the definitive answer for that question.

      Appraisal:

      Taking into account the limitations of the experimental techniques (which I appreciate are due to resource constraints), this paper meets its aim of measuring the antigenic relationships between 2009-2020 seasonal N1s, showing that there were two main groups. The authors discovered that the difference between the two antigenic groups was likely attributable to positions 321 and 432, as these were the only two positions that were consistently different between the two groups. They came to this finding by using a random forest model, but other simpler methods could have been used.

      Impact:

      This paper contributes to the growing literature on the potential benefit of NA in the influenza vaccine.

      Reviewer #2 (Public review):

      Summary:

      In this study, Catani et al. have immunized mice with 17 recombinant N1 neuraminidases (NAs) from human isolates circulating between 2009-2020 to investigate antigenic diversity. NA inhibition (NAI) titers revealed two groups that were antigenically and phylogenetically distinct. Machine learning was used to estimate the antigenic distances between the N1 NAs and mutations at residues K432E and I321V were identified as key determinants of N1 NA antigenicity.

      Strengths:

      Observation of mutations associated with N1 antigenic drift.

      Weaknesses:

      Validation that K432E and I321V are responsible for antigenic drift was not determined in a background strain with native K432 and I321 or the restitution of antibody binding by reversion to K432 and I321 in strains that evaded sera.

      Reassortant A/Wisconsin/588/2019 with E432K, V321I and also K386N single mutations will be rescued and tested against the panel of sera.

    1. Author response:

      eLife Assessment

      This valuable study presents a theoretical model of how punctuated mutations influence multistep adaptation, supported by empirical evidence from some TCGA cancer cohorts. This solid model is noteworthy for cancer researchers as it points to the case for possible punctuated evolution rather than gradual genomic change. However, the parametrization and systematic evaluation of the theoretical framework in the context of tumor evolution remain incomplete, and alternative explanations for the empirical observations are still plausible.

      We thank the editor and the reviewers for their thorough engagement with our work. The reviewers’ comments have drawn our attention to several important points that we have addressed in the updated version. We believe that these modifications have substantially improved our paper.

      There were two major themes in the reviewers’ suggestions for improvement. The first was that we should demonstrate more concretely how the results in the theoretical/stylized modelling parts of our paper quantitatively relate to dynamics in cancer.

      To this end, we have now included a comprehensive quantification of the effect sizes of our results across large and biologically-relevant parameter ranges. Specifically, following reviewer 1’s suggestion to give more prominence to the branching process, we have added two figures (Fig S3-S4) quantifying the likelihood of multi-step adaptation in a branching process for a large range of mutation rates and birth-death ratios. Formulating our results in terms of birth-death ratios also allowed us to provide better intuition regarding how our results manifest in models with constant population size vs models of growing populations. In particular, the added figure (Fig S3) highlights that the effect size of temporal clustering on the probability of successful 2-step adaptation is very sensitive to the probability that the lineage of the first mutant would go extinct if it did not acquire a second mutation. As a result, the phenomenon we describe is biologically likely to be most effective in those phases during tumor evolution in which tumor growth is constrained. This important pattern had not been described sufficiently clearly in the initial version of our manuscript, and we thank both reviewers for their suggestions to make these improvements.

      The second major theme in the reviewers’ suggestions was focused on how we relate our theoretical findings to readouts in genomic data, with both reviewers pointing to potential alternative explanations for the empirical patterns we describe.

      We have now extended our empirical analyses following some of the reviewers’ suggestions. Specifically, we have included analyses investigating how the contribution of reactive oxygen species (ROS)-related mutation signatures correlates with our proxies for multi-step adaptation; and we have included robustness checks in which we use Spearman instead of Pearson correlations. Moreover, we have included more discussion on potential confounds and the assumptions going into our empirical analyses as well as the challenges in empirically identifying the phenomena we describe.

      Below, we respond in detail to the individual comments made by each reviewer.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      Grasper et al. present a combined analysis of the role of temporal mutagenesis in cancer, which includes both theoretical investigation and empirical analysis of point mutations in TCGA cancer patient cohorts. They find that temporally elevated mutation rates contribute to cancer fitness by allowing fast adaptation when the fitness drops (due to previous deleterious mutations). This may be relevant in the case of tumor suppressor genes (TSG), which follow the 2-hit hypothesis (i.e., biallelic 2 mutations are necessary to deactivate TS), and in cases where temporal mutagenesis occurs (e.g., high APOBEC, ROS). They provide evidence that this scenario is likely to occur in patients with some cancer types. This is an interesting and potentially important result that merits the attention of the target audience. Nonetheless, I have some questions (detailed below) regarding the design of the study, the tools and parametrization of the theoretical analysis, and the empirical analysis, which I think, if addressed, would make the paper more solid and the conclusion more substantiated.

      Strengths:

      Combined theoretical investigation with empirical analysis of cancer patients.

      Weaknesses:

      Parametrization and systematic investigation of theoretical tools and their relevance to tumor evolution.

      We sincerely thank Reviewer 1 for their comments. As communicated in more detail in the point-by-point replies to the “Recommendations for the authors”, we have revised the paper to address these comments in various ways. To summarize, Reviewer 1 asked for (1) more comprehensive analyses of the parameter space, especially in ranges of small fitness effects and low mutation rates; (2) additional clarifications on details of mechanisms described in the manuscript; and (3) suggested further robustness checks to our empirical analyses. We have addressed these points as follows: we have added detailed analyses of dynamics and effect sizes for branching processes (see Sections SI2 and SI3 in the Supplementary Information, as well as Figures S3 and S4). As suggested, these additions provide characterizations of effect sizes in biologically relevant parameter ranges (low mutation rates and smaller fitness effect sizes), and extend our descriptions to processes with dynamically changing population sizes. Moreover, we have added further clarifications at suggested points in the manuscript, e.g. to elaborate on the non-monotonicities in Fig 3. Lastly, we have undertaken robustness checks using Spearman rather than Pearson correlation coefficients to quantify relations between TSG deactivation and APOBEC signature contribution, and have performed analyses investigating dynamics of reactive oxygen species-associated mutagenesis instead of APOBEC.

      Reviewer #2 (Public review):

      This work presents theoretical results concerning the effect of punctuated mutation on multistep adaptation and empirical evidence for that effect in cancer. The empirical results seem to agree with the theoretical predictions. However, it is not clear how strong the effect should be on theoretical grounds, and there are other plausible explanations for the empirical observations.

      Thank you very much for these comments. We have now substantially expanded our investigations of the parameter space as outlined in the response to the “eLife Assessment” above and in the detailed comments below (A(1)-A(3)) to convey more quantitative intuition for the magnitude of the effects we describe for different phases of tumor evolution. We agree that there could be potential additional confounders to our empirical investigations besides the challenges regarding quantification that we already described in our initial version of the manuscript. We have thus included further discussion of these in our manuscript (see replies to B(1)-B(3)), and we have expanded our empirical analyses as outlined in the response to the “eLife Assessment”.

      For various reasons, the effect of punctuated mutation may be weaker than suggested by the theoretical and empirical analyses:

      (A1) The effect of punctuated mutation is much stronger when the first mutation of a two-step adaptation is deleterious (Figure 2). For double inactivation of a TSG, the first mutation--inactivation of one copy--would be expected to be neutral or slightly advantageous. The simulations depicted in Figure 4, which are supposed to demonstrate the expected effect for TSGs, assume that the first mutation is quite deleterious. This assumption seems inappropriate for TSGs, and perhaps the other synergistic pairs considered, and exaggerates the expected effects.

      Thank you for highlighting this discrepancy between Figure 2 and Figure 4. For computational efficiency and for illustration purposes, we had opted for high mutation rates and large fitness effects in Figure 2; however, our results are valid even in the setting of lower mutation rates and fitness effects. To improve the connection to Figure 4, and to address other related comments regarding parameter dependencies, we have now added more detailed quantification of the effects we describe (Figures SF3 and SF4) to the revised manuscript. These additions show that the effects illustrated in Figure 2 retain large effect sizes when going to much lower mutation rates and much smaller fitness effects. Indeed, while under high mutation rates we only see the large relative effects if the first mutation is highly deleterious, these large effects become more universal when going to low mutation rates.

      In general, it is correct that the selective disadvantage (or advantage) conveyed by the first mutation affects the likelihood of successful 2-step adaptations. It is also correct that the magnitude of the ‘relative effect’ of temporal clustering on valley-crossing is highest if the lineage with only the first of the two mutations is vanishingly unlikely to produce a second mutant before going extinct. If the first mutation is strongly deleterious, the lineage of such a first mutant is likely to quickly go extinct – and therefore also more likely to do so before producing a second mutant.

      However, this likelihood of producing the second mutant is also low if the mutation rate is low. As our added figure (Figure SF3) illustrates, at low mutation rates appropriate for cancer cells, is insensitive to the magnitude of the fitness disadvantage for large parts of the parameter space. Especially in populations of constant size (approximated by a birth/death ratio of 1), the relative effects for first mutations that reduce the birth rate by 0.5 or by 0.05 are indistinguishable (Figure SF3f).

      Moreover, the absolute effect (f<sub>k</sub> - f<sub>1</sub>), as we discuss in the paper (Figures SF2 and SF3) is largest in regions of the parameter space in which the first mutant is not infinitesimally unlikely to produce a second mutant (and f<sub>k</sub>  and f<sub>1</sub> would be infinitesimally small), but rather in parameter regions in which this first mutant has a non-negligible chance to produce a second mutant. The absolute effect (f<sub>k</sub> - f<sub>1</sub>) therefore peaks around fitness-neutral first mutations. While the next comment (below) says that our empirical investigations more closely resemble comparisons of relative effects and not absolute effects, we would expect that the observations in our data come preferentially from multi-step adaptations with large absolute effect since the absolute effect is maximal when both f<sub>k</sub> and f<sub>1</sub> are relatively high.

      In summary, we believe Figure 2, while having exaggerated parameters for very defendable reasons, is not a misleading illustration of the general phenomenon or of its applicability in biological settings, as effect sizes remain large when moving to biologically realistic parameter ranges. To clarify this issue, we have largely rewritten the relevant paragraphs in the results section and have added two additional figures (Figures SF3 and SF4) as well as a section in the SI with detailed discussion (SI2).

      (A2) More generally, parameter values affect the magnitude of the effect. The authors note, for example, that the relative effect decreases with mutation rate. They suggest that the absolute effect, which increases, is more important, but the relative effect seems more relevant and is what is assessed empirically.

      Thank you for this comment. As noted in the replies to the above comments, we have now included extensive investigations of how sensitive effect sizes are to different parameter choices. We also apologize for insufficiently clearly communicating how the quantities in Figure 4 relate to the findings of our theoretical models.

      The challenge in relating our results to single-timepoint sequencing data is that we only observe the mutations that a tumor has acquired, but we do not directly observe the mutation rate histories that brought about these mutations. As an alternative readout, we therefore consider (through rough proxies: TSGs and APOBEC signatures) the amount of 2-step adaptations per acquired/retained mutation. While we unfortunately cannot control for the average mutation rate in a sample, we motivate using this “TSG-deactivation score” by the hypothesis that for any given mutation rate, we expect a positive relationship between the amount of temporal clustering and the amount of 2-step adaptations per acquired/retained mutation. This hypothesis follows directly from our theoretical model where it formally translates to the statement that for a fixed μ, f<sub>k</sub> is increasing in k.

      However, while both quantities f<sub>k</sub>/f<sub>1</sub> or f<sub>k</sub> - f<sub>1</sub> from our theoretical model relate to this hypothesis – both are increasing in k –, neither of them maps directly onto the formulation of our empirical hypothesis.

      We have now rewritten the relevant passages of the manuscript to more clearly convey our motivation for constructing our TSG deactivation score in this form (P. 4-6).

      (A3) Routes to inactivation of both copies of a TSG that are not accelerated by punctuation will dilute any effects of punctuation. An example is a single somatic mutation followed by loss of heterozygosity. Such mechanisms are not included in the theoretical analysis nor assessed empirically. If, for example, 90% of double inactivations were the result of such mechanisms with a constant mutation rate, a factor of two effect of punctuated mutagenesis would increase the overall rate by only 10%. Consideration of the rate of apparent inactivation of just one TSG copy and of deletion of both copies would shed some light on the importance of this consideration.

      This is a very good point, thank you. In our empirical analyses, the main motivation was to investigate whether we would observe patterns that are qualitatively consistent with our theoretical predictions, i.e. whether we would find positive associations between valley-crossing and temporal clustering. Our aim in the empirical analyses was not to provide a quantitative estimate of how strongly temporally clustered mutation processes affect mutation accumulation in human cancers. We hence restricted attention to only one mutation process which is well characterized to be temporally clustered (APOBEC mutagenesis) and to only one category of (epi)genomic changes (SNPs, in which APOBEC signatures are well characterized). Of course, such an analysis ignores that other mutation processes (e.g. LOH, copy number changes, methylation in promoter regions, etc.) may interact with the mechanisms that we consider in deactivating Tumor suppressor genes.

      We have now updated the text to include further discussion of this limitation and further elaboration to convey that our empirical analyses are not intended as a complete quantification of the effect of temporal clustering on mutagenesis in-vivo (P. 10,11).

      Several factors besides the effects of punctuated mutation might explain or contribute to the empirical observations:

      (B1) High APOBEC3 activity can select for inactivation of TSGs (references in Butler and Banday 2023, PMID 36978147). This selective force is another plausible explanation for the empirical observations.

      Thank you for making this point. We agree that increased APOBEC3 activity, or any other similar perturbation, can change the fitness effect that any further changes/perturbations to the cell would bring about. Our empirical analyses therefore rely on the assumption that there are no major confounding structural differences in selection pressures between tumors with different levels of APOBEC signature contributions. We have expanded our discussion section to elaborate on this potential limitation (P. 10-11).

      While the hypothesis that APOBEC3 activity selects for inactivation of TSGSs has been suggested, there remain other explanations. Either way, the ways in which selective pressures have been suggested to change would not interfere relevantly with the effects we describe. The paper cited in the comment argues that “high APOBEC3 activity may generate a selective pressure favoring” TSG mutations as “APOBEC creates a high [mutation] burden, so cells with impaired DNA damage response (DDR) due to tumor suppressor mutations are more likely to avert apoptosis and continue proliferating”. To motivate this reasoning, in the same passage, the authors cite a high prevalence of TP53 mutations across several cancer types with “high burden of APOBEC3-induced mutations”, but also note that “this trend could arise from higher APOBEC3 expression in p53-mutated tumors since p53 may suppress APOBEC3B transcription via p21 and DREAM proteins”.

      Translated to our theoretical framework, this reasoning builds on the idea that APOBEC3 activity increases the selective advantage of mutants with inactivation of both copies of a TSG. In contrast, the mechanism we describe acts by altering the chances of mutants with only one TSG allele inactivated to inactivate the second allele before going extinct. If homozygous inactivation of TSGs generally conveys relatively strong fitness advantages, lineages with homozygous inactivation would already be unlikely to go extinct. Further increasing the fitness advantage of such lineages would thus manifest mostly in a quicker spread of these lineages, rather than in changes in the chance that these lineages survive. In turn, such a change would have limited effect on the “rate” at which such 2-step adaptations occur, but would mostly affect the speed at which they fixate. It would be interesting to investigate these effects empirically by quantifying the speed of proliferation and chance of going extinct for lineages that newly acquired inactivating mutations in TSGs.

      Beyond this explicit mention of selection pressures, the cited paper also discusses high occurrences of mutations in TSGs in relation to APOBEC. These enrichments, however, are not uniquely explained by an APOBEC-driven change in selection pressures. Indeed, our analyses would also predict such enrichments.

      (B2) Without punctuation, the rate of multistep adaptation is expected to rise more than linearly with mutation rate. Thus, if APOBEC signatures are correlated with a high mutation rate due to the action of APOBEC, this alone could explain the correlation with TSG inactivation.

      Thank you for making this point. Indeed, an identifying assumption that we make is that average mutation rates are balanced between samples with a higher vs lower APOBEC signature contribution. We cannot cleanly test this assumption, as we only observe aggregate mutation counts but not mutation rates. However, the fact that we observe an enrichment for APOBEC-associated mutations among the set of TSG-inactivating mutations (see Figure 4F) would be consistent with APOBEC-mutations driving the correlations in Fig 4D, rather than just average mutation rates. We have now added a paragraph to our manuscript to discuss these points (P. 10-11).

      (B3) The nature of mutations caused by APOBEC might explain the results. Notably, one of the two APOBEC mutation signatures, SBS13, is particularly likely to produce nonsense mutations. The authors count both nonsense and missense mutations, but nonsense mutations are more likely to inactivate the gene, and hence to be selected.

      Thank you for making this point.  We have included it in our discussion of potential confounders/limitations in the revised manuscript (P. 10-11).

    1. Author response:

      Reviewer 1:

      Summary:

      This paper describes molecular dynamics simulations (MDS) of the dynamics of two T-cell receptors (TCRs) bound to the same major histocompatibility complex molecule loaded with the same peptide (pMHC). The two TCRs (A6 and B7) bind to the pMHC with similar affinity and kinetics, but employ different residue contacts. The main purpose of the study is to quantify via MDS the differences in the inter- and intra-molecular motions of these complexes, with a specific focus on what the authors describe as catch-bond behavior between the TCRs and pMHC, which could explain how T-cells can discriminate between different peptides in the presence of weak separating force.

      Strengths:

      The authors present extensive simulation data that indicates that, in both complexes, the number of high-occupancy interdomain contacts initially increases with applied load, which is generally consistent with the authors’ conclusion that both complexes exhibit catch-bond behavior, although to different extents. In this way, the paper somewhat expands our understanding of peptide discrimination by T-cells.

      The reviewer makes thoughtful assessments of our manuscript. While our manuscript is meant to be a “short” contribution, our significant new finding is that even for TCRs targeting the same pMHC, having similar structures, and leading to similar functional outcomes in conventional assays, their response to applied load can be different. This supports out recent experimental work where TCRs targeting the same pMHC differed in their catch bond characteristics, and importantly, in their response to limiting copy numbers of pMHCs on the antigen-presenting cell (Akitsu et al., Sci. Adv., 2024; cited in our manuscript). Our present manuscript provides the physical basis where two similar TCRs respond to applied load differently. In the revised manuscript, we will make this point clearer.

      Weaknesses:

      While generally well supported by data, the conclusions would nevertheless benefit from a more concise presentation of information in the figures, as well as from suggesting experimentally testable predictions.

      Following the reviewers’ suggestions, we will update figures and use Figure Supplements to make the main figures more concise and to simplify the overall presentation.

      Regarding testable predictions, one prediction would be that B7 TCR will exhibit weaker catch bond behavior than A6. This is an important prediction because the two TCRs targeting the same pMHC have similar structures and are functionally similar in conventional assays. This prediction can be tested by single-molecule optical tweezers experiments. We also predict the A6 TCR may perform better when the number of pMHC molecules presented are limited, analogous to our recent experiments on different TCRs, Akitsu et al., Sci. Adv. (2024).

      Another testable prediction for the conservation of the basic allostery mechanism is to test the Cβ FG-loop deletion mutant located at the hinge region of the β chain, yet its deletion severely impairs the catch bond formation. These predictions will be mentioned and discussed in the updated manuscript.

      Reviewer 2:

      In this work, Chang-Gonzalez and coworkers follow up on an earlier study on the force-dependence of peptide recognition by a T-cell receptor using all-atom molecular dynamics simulations. In this study, they compare the results of pulling on a TCR-pMHC complex between two different TCRs with the same peptide. A goal of the paper is to determine whether the newly studied B7 TCR has the same load-dependent behavior mechanism shown in the earlier study for A6 TCR. The primary result is that while the unloaded interaction strength is similar, A6 exhibits more force stabilization.

      This is a detailed study, and establishing the difference between these two systems with and without applied force may establish them as a good reference setup for others who want to study mechanobiological processes if the data were made available, and could give additional molecular details for T-Cell-specialists. As written, the paper contains an overwhelming amount of details and it is difficult (for me) to ascertain which parts to focus on and which results point to the overall take-away messages they wish to convey.

      As mentioned above and as the reviewer correctly pointed out, the condensed appearance of this manuscript arose largely because we intended it to be a Research Advances article as a short follow up study of our previous paper on A6 TCR published in eLife. Most of the analysis scripts for the A6 TCR study are already available on Github. We will additionally deposit sample structures and simulation scripts for the B7 TCR. Trajectory will be provided upon request given their large size.

      Regarding the focus issue, it is in part due to the complex nature of the problem, which required simulations under different conditions and multi-faceted analyses. Concisely presenting the complex analyses also has been a challenge in our previous papers on TCR simulations (Hwang et al., PNAS 2020; Chang-Gonzalez et al., eLife, 2024 – both are cited in our manuscript). With updated figures and texts, we expect that the presentation will be a lot clearer. But even in the present form, the reviewer points out the main take-away message well: “The primary result is that while the unloaded interaction strength is similar, A6 exhibits more force stabilization.

      Detailed comments:

      (1) In Table 1 - are the values of the extension column the deviation from the average length at zero force (that is what I would term extension) or is it the distance between anchor points (which is what I would assume based on the large values. If the latter, I suggest changing the heading, and then also reporting the average extension with an asterisk indicating no extensional restraints were applied for B7-0, or just listing 0 load in the load column. Standard deviation in this value can also be reported. If it is an extension as I would define it, then I think B7-0 should indicate extension = 0+/- something.

      The distance between anchor points could also be labeled in Figure 1A.

      “Extension” is the distance between anchor points (blue spheres at the ends of the added strands in Fig. 1A). While its meaning should be clear in the section “Laddered extensions” in MD simulation protocol, at first glance it may lead to confusion. In a strict sense, use of “extension” for the distance is a misnomer, but we have used it in our previous two papers (Hwang et al., PNAS 2020; Chang-Gonzalez et al., eLife, 2024), so we prefer to keep it for consistency. Instead, in the caption of Table 1, we will explain its meaning, and also explicitly label it in Fig. 1A, as the reviewer suggested.

      Please also note that the no-load case B7<sup>0</sup> does not have a particular extension that yields zero load on average. It would in fact be very difficult to find such an extension (distance between two anchor points). To simulate the system without load, we separately built a TCR-pMHC complex without added linkers, and held the distal part of pMHC with weak harmonic restraints (explained in sections “Structure preparation” and “Systems without load”). In this way, no external force is applied to TCR as it moves relative to pMHC. We will clarify this when introducing B7<sup>0</sup> in the Results section.

      (2) As in the previous paper, the authors apply ”constant force” by scanning to find a particular bond distance at which a desired force is selected, rather than simply applying a constant force. I find this approach less desirable unless there is experimental evidence suggesting the pMHC and TCR were forced to be a particular distance apart when forces are applied. It is relatively trivial to apply constant forces, so in general, I would suggest this would have been a reasonable comparison. Line 243-245 speculates that there is a difference in catch bonding behavior that could be inferred because lower force occurs at larger extensions, but I do not believe this hypothesis can be fully justified and could be due to other differences in the complex.

      There is indeed experimental evidence that the TCR-pMHC complex operates under constant separation. The spacing between a T-cell and an antigen-presenting cell is maintained by adhesion molecules such as the CD2CD58 pair, as explained in our paper on the A6 TCR, (Chang-Gonzalez et al., eLife, 2024; please see the bottom paragraph on page 4 of the paper). In in vitro single-molecule experiments, pulling to a fixed separation and holding is also commonly done. Detailed comparison between constant extension vs. constant force simulations is definitely a subject of our future study. We will clarify these points when explaining about the constant extension (or separation).

      Regarding line 243–245, we agree with the reviewer that without further tests, lower forces at larger extensions per se cannot be an indicator that B7 forms a weaker catch bond. But with additional insight, it does have an indirect relevance. In addition to fewer TCR-pMHC contacts (Fig. 1C of our manuscript), the intra-TCR contacts are also reduced compared to those of A6 (Fig. 1D vs. Chang-Gonzalez et al., eLife, 2024, Fig. 8A,B, first column; reproduced in the figure in our response to reviewer 3 below). This shows that the B7 TCR forms a looser complex with pMHC compared to A6. With its higher compliance, the B7 TCR-pMHC complex needs to be under a greater extension than A6 to apply comparable levels of force, and it would be more difficult to achieve load-induced stabilization of the TCR-pMHC interface, hence a weaker catch bond. We will add this point when explaining the weaker catch bond behavior of B7.

      (3) On a related note, the authors do not refer to or consider other works using MD to study force-stabilized interactions (e.g. for catch bonding systems), e.g. these cases where constant force is applied and enhanced sampling techniques are used to assess the impact of that applied force: https://www.cell.com/biophysj/fulltext/S0006-3495(23)00341-7, https://www.biorxiv.org/content/10.1101/2024.10.10.617580v1. I was also surprised not to see this paper on catch bonding in pMHC-TCR referred to, which also includes some MD simulations: https://www.nature.com/articles/s41467-023-38267-1

      We thank the reviewer for bringing the three papers to our attention, which are:

      (1) Languin-Cattoën, Sterpone, and Stirnemann, Biophys. J. 122:2744 (2023): About bacterial adhesion protein FimH.

      (2) Peña Ccoa, et al., bioRxiv (2024): About actin binding protein vinculin.

      (3) Choi et al., Nat. Comm. 14:2616 (2023): About a mathematical model of the TCR catch bond.

      Catch bond mechanisms of FimH and vinculin are different from that of TCR in that FimH and vinculin have relatively well-defined weak- and strong-binding states where there are corresponding crystal structures. Availability of the end-state structures enable using simulation approaches such as enhanced sampling of individual states and studying the transition between the two states. In contrast, TCR does not have any structurally well-defined weakor strong-binding states, which requires a different approach. As demonstrated in our current manuscript as well as in our previous two papers (Hwang et al., PNAS 2020; Chang-Gonzalez et al., eLife, 2024), our microsecond-long simulations of the complex under realistic pN-level loads and a combination of analysis methods are effective for elucidating the catch bond mechanism of TCR. In the revised manuscript, we will cite the two papers, to compare the TCR catch bond mechanism with those of FimH and vinculin, which will offer a broader perspective.

      The third paper (Choi, 2023) proposes a mathematical model to analyze extensive sets of data, and also perform new experiments and additional simulations. Of note, their model assumptions are based mainly on the steered MD (SMD) simulation in their previous paper (Wu, et al., Mol. Cell. 73:1015, 2019). In their model, formation of a catch bond (called catch-slip bond in Choi’s paper) requires partial unfolding of MHC and tilting of the TCR-pMHC interface. While further studies are needed to find whether those changes are indeed required, even so, the question remains regarding how the complex in the fully folded state can bear load and enter such a state in the first place. Our current and previous simulation studies suggest a mechanism by which ligand- and load-dependent responses occur as the first obligatory step of catch bond formation, after which partial unfolding and/or extensive conformational transitions may occur, as described in our recent paper (Akitsu et al., Sci. Adv., 2024). In the revised manuscript, we will cite Wu’s paper and briefly explain the above.

      (4) The authors should make at least the input files for their system available in a public place (github, zenodo) so that the systems are a more useful reference system as mentioned above. The authors do not have a data availability statement, which I believe is required.

      As mentioned above, we will make sample input files and coordinates available on Github. Data availability statement will be added.

      Reviewer 3:

      Summary:

      The paper by Chang-Gonzalez et al. is a molecular dynamics (MD) simulation study of the dynamic recognition (load-induced catch bond) by the T cell receptor (TCR) of the complex of peptide antigen (p) and the major histocompatibility complex (pMHC) protein. The methods and simulation protocols are essentially identical to those employed in a previous study by the same group (Chang-Gonzalez et al., eLife 2024). In the current manuscript, the authors compare the binding of the same pMHC to two different TCRs, B7 and A6 which was investigated in the previous paper. While the binding is more stable for both TCRs under load (of about 10-15 pN) than in the absence of load, the main difference is that, with the current MD sampling, B7 shows a smaller amount of stable contacts with the pMHC than A6.

      Strengths:

      The topic is interesting because of the (potential) relevance of mechanosensing in biological processes including cellular immunology.

      Weaknesses:

      The study is incomplete because the claims are based on a single 1000-ns simulation at each value of the load and thus some of the results might be marred by insufficient sampling, i.e., statistical error. After the first 600 ns, the higher load of B7high than B7low is due mainly to the simulation segment from about 900 ns to 1000 ns (Figure 1D). Thus, the difference in the average value of the load is within their standard deviation (9 +/- 4 pN for B7low and 14.5 +/- 7.2 for B7high, Table 1). Even more strikingly, Figure 3E shows a lack of convergence in the time series of the distance between the V-module and pMHC, particularly for B70 (left panel, yellow) and B7low (right panel, orange). More and longer simulations are required to obtain a statistically relevant sampling of the relative position and orientation of the V-module and pMHC.

      The reviewer uses data points during the last 100 ns to raise an issue with sampling. But since we are using realistic pN range forces, force fluctuates more slowly. In fact, in our simulation of B7<sup>high</sup>, while the force peaks near 35 pN at 500 ns (Fig. 1D of our manuscript; reproduced as panels C and D below), the contact heat map shows no noticeable changes around 500 ns (Fig. 2C of our manuscript). Thus, a wider time window must be considered rather than focusing on instantaneous force.

      We believe the reviewer’s concern about sampling arose also due to a lack of clear explanation. Author response image 1 below contains panels from our earlier eLife paper on the A6 TCR. Panels A and B are from Fig. 8 of the A6 paper, and panels C and D are from Fig. 1D of our present manuscript. The high-load simulations in both cases (outlined circles) fluctuate widely in force so that one might argue that sampling was insufficient. However, unless one is interested in finding the precise value of force for a given extension, sampling in our simulations was reasonable enough to distinguish between high- and low-force behaviors. To support this, we show panel E below, which is from Appendix 3–Fig. 1 of our A6 paper. Added to this panel are the average forces and standard deviations of B7<sup>low</sup> and B7<sup>high</sup> from Table 1 of our manuscript (red squares). Please note that all of the data were measured after 500 ns. Except for Y8A<sup>low</sup> and dFG<sup>low</sup> of A6 (explained below), all of the data points lie on nearly a straight line.

      Author response image 1.

      Thermodynamically, the force and position of the restraint (blue spheres in Fig. 1A of our manuscript) form a pair of generalized force and the corresponding spatial variable in equilibrium at temperature 300 K, which is akin to the pressure P and volume V of an ideal gas. If V is fixed, P fluctuates. Denoting the average and std of pressure as ⟨P⟩ and ∆P, respectively, Burgess showed that ∆P/P⟩ is a constant (Eq. 5 of Burgess, Phys. Lett. A, 44:37; 1973). In the case of the TCRαβ-pMHC system, although individual atoms are not ideal gases, since their motion leads to the fluctuation in force on the restraints, the situation is analogous to the case where pressure arises from individual ideal gas molecules hitting the confining wall as the restraint. Thus, the near-linear behavior in panel E above is a consequence of the system being many-bodied and at constant temperature. The linearity is also an indirect indicator that sampling of force was reasonable. The fact that A6 and B7 data show a common linear profile further demonstrates the consistency in our force measurement. That said, the B7 data points (red in panel E) are elevated slightly above nearby A6 data points. This is consistent with B7 forming an overall weaker complex, both at the TCR-pMHC interface (panels A vs. C) and within intra-TCR interfaces (panels B vs. D), which can be seen by the wider ranges of color bars in panels A and B for A6 compared to panels C and D for B7.

      About the two outliers of A6, Y8A<sup>low</sup> is for an antagonist peptide and dFG<sup>low</sup> is the Cβ FG-loop deletion mutant. Interestingly, both cases had reduced numbers of contacts with pMHC, which likely caused a wider conformational motion, hence greater fluctuation in force.

      A similar argument applies to Fig. 3E of our manuscript. If precise values of the V-module to pMHC distance were needed, longer or duplicate simulations would be necessary, however, Fig. 3E as it currently stands clearly shows that B7<sup>high</sup> maintains more stable interface compared to B7<sup>low</sup>, which is consistent with all other measures we used, such as Fig. 3B (Hamming distance), Fig. 3C (buried surface area), and Fig. 4A–E (Vα-Vβ motion and CDR3 distance). They are also consistent with our simulations of A6.

      Thus, rather than relying on peculiarities of individual trajectories, we analyze data in multiple ways and draw conclusions based on features that are consistent across different simulations. Please also note that reviewer 1 mentioned that our conclusions are “generally well supported by data.”

      We will update our manuscript to concisely explain the above and also will add Panel E above as a supplement of Fig. 1.

      It is not clear why ”a 10 A distance restraint between alphaT218 and betaA259 was applied” (section MD simulation protocol, page 9).

      αT218 and βA259 are the residues attached to a leucine-zipper handle in in vitro optical trap experiments (Das, et al., PNAS 2015). In T cells, those residues also connect to transmembrane helices. Author response image 2 is a model of N15 TCR used in experiments in Das’ paper, constructed based on PDB 1NFD. Blue spheres represent Cα atoms corresponding to αT218 and βA259 of B7 TCR. Their distance is 6.7 ˚A. The 10-˚A distance restraint in simulation was applied to mimic the presence of the leucine zipper that prevents excessive separation of the added strands. The distance restraint is a flat-bottom harmonic potential which is activated only when the distance between the two atoms exceeds 10 ˚A, which we did not clarify in our original manuscript. The same restraint was used in our previous studies on JM22 and A6 TCRs.

      We will add the figure as a supplement of Fig. 1, cite Das’ paper, and also update description of the distance restraint in the MD simulation protocol section.

      Author response image 2.

    1. Author response:

      Public Reviews:

      We thank the reviewers for their overall positive assessments and constructive feedback

      Reviewer #1 (Public Review):

      Summary:

      The study explored the biomechanics of kangaroo hopping across both speed and animal size to try and explain the unique and remarkable energetics of kangaroo locomotion.

      Strengths:

      The study brings kangaroo locomotion biomechanics into the 21st century. It is a remarkably difficult project to accomplish. There is excellent attention to detail, supported by clear writing and figures.

      Weaknesses:

      The authors oversell their findings, but the mystery still persists.

      The manuscript lacks a big-picture summary with pointers to how one might resolve the big question.

      General Comments

      This is a very impressive tour de force by an all-star collaborative team of researchers. The study represents a tremendous leap forward (pun intended) in terms of our understanding of kangaroo locomotion. Some might wonder why such an unusual species is of much interest. But, in my opinion, the classic study by Dawson and Taylor in 1973 of kangaroos launched the modern era of running biomechanics/energetics and applies to varying degrees to all animals that use bouncing gaits (running, trotting, galloping and of course hopping). The puzzling metabolic energetics findings of Dawson & Taylor (little if any increase in metabolic power despite increasing forward speed) remain a giant unsolved problem in comparative locomotor biomechanics and energetics. It is our "dark matter problem".

      Thank you for the kind words

      This study is certainly a hop towards solving the problem. But, the title of the paper overpromises and the authors present little attempt to provide an overview of the remaining big issues.

      We will modify the title to reflect this comment.  

      The study clearly shows that the ankle and to a lesser extent the mtp joint are where the action is. They clearly show in great detail by how much and by what means the ankle joint tendons experience increased stress at faster forward speeds.

      Since these were zoo animals, direct measures were not feasible, but the conclusion that the tendons are storing and returning more elastic energy per hop at faster speeds is solid.

      The conclusion that net muscle work per hop changes little from slow to fast forward speeds is also solid.

      Doing less muscle work can only be good if one is trying to minimize metabolic energy consumption. However, to achieve greater tendon stresses, there must be greater muscle forces. Unless one is willing to reject the premise of the cost of generating force hypothesis, that is an important issue to confront.

      Further, the present data support the Kram & Dawson finding of decreased contact times at faster forward speeds. Kram & Taylor and subsequent applications of (and challenges to) their approach supports the idea that shorter contact times (tc) require recruiting more expensive muscle fibers and hence greater metabolic costs. Therefore, I think that it is incumbent on the present authors to clarify that this study has still not tied up the metabolic energetics across speed problems and placed a bow atop the package.

      Fortunately, I am confident that the impressive collective brain power that comprises this author list can craft a paragraph or two that summarizes these ideas and points out how the group is now uniquely and enviably poised to explore the problem more using a dynamic SIMM model that incorporates muscle energetics (perhaps ala' Umberger et al.). Or perhaps they have other ideas about how they can really solve the problem.

      You have raised important points, thank you for this feedback. We will add a paragraph discussing the limitations of our study and ensure the revised manuscript makes it clear which mysteries remain. We intend to address muscle forces, contact time, and energetics in future work when we have implemented all hindlimb muscles within the musculoskeletal model.  

      I have a few issues with the other half of this study (i.e. animal size effects). I would enjoy reading a new paragraph by these authors in the Discussion that considers the evolutionary origins and implications of such small safety factors. Surely, it would need to be speculative, but that's OK.

      We will integrate this into the discussion.

      Reviewer #2 (Public Review):

      Summary

      This is a fascinating topic that has intrigued scientists for decades. I applaud the authors for trying to tackle this enigma. In this manuscript, the authors primarily measured hopping biomechanics data from kangaroos and performed inverse dynamics.

      While these biomechanical analyses were thorough and impressively incorporated collected anatomical data and an Opensim model, I'm afraid that they did not satisfactorily address how kangaroos can hop faster and not consume more metabolic energy, unique from other animals.

      Noticeably, the authors did not collect metabolic data nor did they model metabolic rates using their modelling framework. Instead, they performed a somewhat traditional inverse dynamics analysis from multiple animals hopping at a self-selected speed.

      We aimed to provide a joint-level explanation, but we will address the limitations of not modelling the energy consumers themselves (the skeletal muscles) in the revised manuscript. We plan to expand upon muscle level energetics in the future with a more detailed MSK model.

      Within these analyses, the authors largely focused on ankle EMA, discussing its potential importance (because it affects tendon stress, which affects tendon strain energy, which affects muscle mechanics) on the metabolic cost of hopping. However, EMA was roughly estimated (CoP was fixed to the foot, not measured)…

      As noted in our methods, EMA was not calculated from a fixed centre of pressure (CoP). We did fix the medial-lateral position, owing to the fact that both feet contacted the force plate together, but the anteroposterior movement of the CoP was recorded by the force plate and thus allowed to move. We report the movement (or lack of movement) in our results. The anterior-posterior axis is the most relevant to lengthening or shortening the distance of the ‘out-lever’ R, and thereby EMA.

      It is necessary to assume fixed medial-lateral position because a single force trace and CoP is recorded when two feet land on the force plate. The medial-lateral forces on each foot cancel out so there is no overall medial-lateral movement if the forces are symmetrical (e.g. if the kangaroo is hopping in a straight path and one foot is not in front of the other). We only used symmetrical trials so that the anterior-posterior movement of the CoP would be reliable.

      and did not detectibly associate with hopping speed (see results).

      Yet, the authors interpret their EMA findings as though it systematically related with speed to explain their theory on how metabolic cost is unique in kangaroos vs. other animals.

      Indeed, the relationship between R and speed (and therefore EMA and speed) was not significant. However, the significant change in ankle height with speed, combined with no systematic change in COP at midstance, demonstrates that R would get longer at faster speeds. If we consider the nonsignificant relationship between R and speed to indicate that there is no change in R, then these two results conflict. We could not find a flaw in our methods, so instead concluded that the nonsignificant relationship between R and speed may be due to a small change in R being undetectable in our data. Taking both results into account, we think it is more likely that there is a non-detectable change in R, rather than no change in R with speed, but we presented both results for transparency.

      These speed vs. biomechanics relationships were limited by comparisons across different animals hopping at different speeds and could have been strengthened using repeated measures design.

      There is significant variation in speed within individuals, not just between individuals. The preferred speed of kangaroos is 2-4.5 m/s, but most individuals show a wide range within this. Eight of our 16 kangaroos had a maximum speed that was between 1-2m/s faster than their slowest trial. Repeated measures of these eight individuals comprises 78 out of the 100 trials.

      It would be ideal to collect data across the full range of speeds for all individuals, but it is not feasible in this type of experimental setting. Interference such as chasing is dangerous to kangaroos as they are prone to strong adverse reactions to stress.

      There are also multiple inconsistencies between the authors' theory on how mechanics affect energetics and the cited literature, which leaves me somewhat confused and wanting more clarification and information on how mechanics and energetics relate.

      We will ensure that this is clearer in the revised manuscript.

      My apologies for the less-than-favorable review, I think that this is a neat biomechanics study - but am unsure if it adds much to the literature on the topic of kangaroo hopping energetics in its current form.

      Reviewer #3 (Public Review):

      Summary:

      The goal of this study is to understand how, unlike other mammals, kangaroos are able to increase hopping speed without a concomitant increase in metabolic cost. They use a biomechancial analysis of kangaroo hopping data across a range of speeds to investigate how posture, effective mechanical advantage, and tendon stress vary with speed and mass. The main finding is that a change in posture leads to increasing effective mechanical advantage with speed, which ultimately increases tendon elastic energy storage and returns via greater tendon strain. Thus kangaroos may be able to conserve energy with increasing speed by flexing more, which increases tendon strain.

      Strengths:

      The approach and effort invested into collecting this valuable dataset of kangaroo locomotion is impressive. The dataset alone is a valuable contribution.

      Thank you!

      Weaknesses:

      Despite these strengths, I have concerns regarding the strength of the results and the overall clarity of the paper and methods used (which likely influences how convincingly the main results come across).

      (1) The paper seems to hinge on the finding that EMA decreases with increasing speed and that this contributes significantly to greater tendon strain estimated with increasing speed. It is very difficult to be convinced by this result for a number of reasons:

      • It appears that kangaroos hopped at their preferred speed. Thus the variability observed is across individuals not within. Is this large enough of a range (either within or across subjects) to make conclusions about the effect of speed, without results being susceptible to differences between subjects?

      Apologies, this was not clear in the manuscript. Kangaroos hopping at their preferred speed means we did not chase or startle them into high speeds to comply with ethics and enclosure limitations. Thus we did not record a wide range of speed within the bounds of what kangaroos are capable of (up to 12 m/s), but for the range we did measure (~2-4.5 m/s), there is variation hopping speed within each individual kangaroo. Out of 16 individuals, eight individuals had a difference of 1-2m/s between their slowest and fastest trials, and these kangaroos accounted for 78 out of 100 trials. Of the remainder, six individuals had three for fewer trials each, and two individual had highly repeatable speeds (3 out of 4, and 6 out of 7 trials were within 0.5 m/s). We will ensure this is clear in the revised manuscript.

      In the literature cited, what was the range of speeds measured, and was it within or between subjects?

      For other literature, to our knowledge the highest speed measured is ~9.5m/s (see supplementary Fig1b) and there were multiple measures for several individuals (see methods Kram & Dawson 1998).

      • Assuming that there is a compelling relationship between EMA and velocity, how reasonable is it to extrapolate to the conclusion that this increases tendon strain and ultimately saves metabolic cost?

      They correlate EMA with tendon strain, but this would still not suggest a causal relationship (incidentally the p-value for the correlation is not reported).

      We will add supporting literature on the relationship between metabolic cost and tendon stress (or strain), to elaborate on why the correlation between EMA and stress is important.

      Tendon strain could be increasing with ground reaction force, independent of EMA.

      Even if there is a correlation between strain and EMA, is it not a mathematical necessity in their model that all else being equal, tendon stress will increase as ema decreases? I may be missing something, but nonetheless, it would be helpful for the authors to clarify the strength of the evidence supporting their conclusions.

      Yes, GRF also contributes to the increase in tendon stress in the mechanism we propose. We have illustrated this in Fig 6, however we will make this clearer in the revised discussion.

      • The statistical approach is not well-described. It is not clear what the form of the statistical model used was and whether the analysis treated each trial individually or grouped trials by the kangaroo. There is also no mention of how many trials per kangaroo, or the range of speeds (or masses) tested.

      The methods include the statistical model with the variables that we used, as well as the kangaroo masses (13.7 to 26.6 kg, mean: 20.9 ± 3.4 kg). We will move the range of speeds from the supplementary material to the results or figure captions. We will add information on the number of trials per kangaroo to the methods.

      We did not group the data e.g. by using an average speed per individual for all their trials, or by comparing fast to slow groups (this was for display purposes in our figures, which we will make clearer in the methods).

      Related to this, there is no mention of how different speeds were obtained. It seems that kangaroos hopped at a self-selected pace, thus it appears that not much variation was observed. I appreciate the difficulty of conducting these experiments in a controlled manner, but this doesn't exempt the authors from providing the details of their approach.

      • Some figures (Figure 2 for example) present means for one of three speeds, yet the speeds are not reported (except in the legend) nor how these bins were determined, nor how many trials or kangaroos fit in each bin. A similar comment applies to the mass categories. It would be more convincing if the authors plotted the main metrics vs. speed to illustrate the significant trends they are reporting.

      Thank you for this comment. The bins are used only for display purposes and not within the analysis. In the revised manuscript, we will ensure this is clear.

      (2) The significance of the effects of mass is not clear. The introduction and abstract suggest that the paper is focused on the effect of speed, yet the effects of mass are reported throughout as well, without a clear understanding of the significance. This weakness is further exaggerated by the fact that the details of the subject masses are not reported.

      Indeed, the primary aim of our study was to explore the influence of speed, given the uncoupling of energy from hopping speed in kangaroos. We included mass to ensure that the effects of speed were not driven by body mass (i.e.: that larger kangaroos hopped faster).  

      (3) The paper needs to be significantly re-written to better incorporate the methods into the results section. Since the results come before the methods, some of the methods must necessarily be described such that the study can be understood at some level without turning to the dedicated methods section. As written, it is very difficult to understand the basis of the approach, analysis, and metrics without turning to the methods.

      We agree, and in the revised manuscript will incorporate some of the methodological details within the results.

      Author response image 1.

    1. Author response:

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      Pradhan et al investigated the potential gustatory mechanisms that allow flies to detect cholesterol. They found that flies are indifferent to low cholesterol and avoid high cholesterol. They further showed that the ionotropic receptors Ir7g, Ir51b, and Ir56d are important for the cholesterol sensitivity in bitter neurons. The figures are clear and the behavior result is interesting. However, I have several major comments, especially on the discrepancy of the expression of these Irs with other lab published results, and the confusing finding that the same receptors (Ir7g, Ir51b) have been implicated in the detection of various seemingly unrelated compounds.

      Strengths:

      The results are very well presented, the figures are clear and well-made, text is easy to follow.

      Weaknesses:

      (1) Regarding the expression of Ir56d. The reported Ir56d expression pattern contradicts multiple previous studies (Brown et al., 2021 eLife, Figure 6a-c; Sanchez-Alcaniz et al., 2017 Nature Communications, Figure 4e-h; Koh et al., 2014 Neuron, Figure 3b). These studies, using three different driver lines, consistently showed Ir56d expression in sweet-sensing neurons and taste peg neurons. Importantly, Sanchez-Alcaniz et al. demonstrated that Ir56d is not expressed in Gr66a-expressing (bitter) neurons. This discrepancy is critical since Ir56d is identified as the key subunit for cholesterol detection in bitter neurons, and misexpression of Ir7g and Ir51b together is insufficient to confer cholesterol sensitivity (Fig.4b,d). Which Ir56d-GAL4 (and Gr66a-I-GFP) line was used in this study? Is there additional evidence (scRNA sequencing, in-situ hybridization, or immunostaining) supporting Ir56d expression in bitter neurons?

      We agree that the expression pattern of Ir56d diverges from two prior reports . The studies by Brown et al. and Koh et al. employed the same Ir56d-GAL4 driver line, which exhibited expression in sweet-sensing gustatory receptor neurons (GRNs) and taste peg neurons, but not bitter GRNs (the Sanchez-Alcaniz et al. paper did not use an Ir56d-Gal4).

      In our study, we used a Ir56d-GAL4 driver line (KDRC:2307) and the Gr66a-I-GFP reporter line (Weiss et al., 2011 Neuron). This is a crucial distinction, as differences in the regulatory regions used to generate different driver lines are well known to underlie differences in expression patterns. Our double-labeling experiments revealed co-expression of Ir56d with Gr66a-positive bitter GRNs specifically within the S6 and S7 sensilla—types previously shown to exhibit strong electrophysiological responses to cholesterol (Figure 2—figure supplement 1F).

      We believe this observation is biologically significant and consistent with our functional data. Specifically, targeted expression of Ir56d in bitter neurons using the Gr33a-GAL4 was sufficient to rescue cholesterol avoidance behavior in Ir56d<sup>1</sup> mutants (Figure 3G). These results demonstrate that Ir56d plays a functional role in bitter GRNs for cholesterol detection. The convergence of genetic, behavioral, and electrophysiological data presented in our study provides compelling support for this previously unappreciated expression pattern and function of Ir56d.

      (2) Ir51b has previously been implicated in detecting nitrogenous waste (Dhakal 2021), lactic acid (Pradhan 2024), and amino acids (Aryal 2022), all by the same lab. Additionally, both Ir7g and Ir51b have been implicated in detecting cantharidin, an insect-secreted compound that flies may or may not encounter in the wild, by the same lab. Is Ir51b proposed to be a specific receptor for these chemically distinct compounds or a general multimodal receptor for aversive stimuli? Unlike other multimodal bitter receptors, the expression level of Ir51b is rather low and it's unclear which subset of GRNs express this receptor. The chemical diversity among nitrogenous waste, amino acids, lactic acid, cantharidin, and cholesterol raises questions about the specificity of these receptors and warrants further investigation and at a minimum discussion in this paper. Given the wide and seemingly unrelated sensitivity of Ir51b and Ir7g to these compounds I'm leaning towards the hypothesis that at least some of these is non-specific and ecologically irrelevant without further supporting evidence from the authors.

      While it is true that IR51b and IR7g are responsive to a range of compounds, they share chemical features such as nitrogen-containing groups, hydrophobicity, or amphipathic structures suggesting that recognition of these chemicals may be mediated by the same or overlapping domains within the receptor complexes. These features could facilitate binding to a structurally diverse yet chemically related groups of aversive ligands.

      In the case of cholesterol, while its sterol ring system is distinct from the other compounds, it shares hydrophobic and amphipathic properties that may enable interaction with these receptors via similar structural motifs. Importantly, our data demonstrates that Ir51b and Ir7g are necessary but not sufficient on their own to confer cholesterol sensitivity, indicating that additional co-factors or receptor subunits are required for full functionality (Figure 4B, D). Furthermore, our dose-response analysis (Figure 3F) shows that Ir7g is particularly important at higher cholesterol concentrations, supporting the idea of graded sensitivity rather than indiscriminate activation. This suggests that these receptors may have evolved to recognize cholesterol and its analogs (e.g., phytosterols such as stigmasterol, yet to be tested), which are naturally found in the fly’s diet (e.g., yeast and plant-derived matter), as ecologically relevant cues signaling microbial contamination, lipid imbalance, or dietary overconsumption.

      We acknowledge the reviewer’s concern regarding the relatively low expression levels of Ir51b and Ir7g. However, we note that low transcript abundance does not necessarily equate to diminished physiological relevance. Finally, we agree that the chemical diversity of ligands associated with Ir51b and Ir7g warrants deeper investigation, particularly through structure-function studies aimed at identifying ligand-binding domains and receptor-ligand interactions at atomic resolution.

      (3) The Benton lab Ir7g-GAL4 reporter shows no expression in adults. Additionally, two independent labellar RNA sequencing studies (Dweck, 2021 eLife; Bontonou et al., 2024 Nature Communications) failed to detect Ir7g expression in the labellum. This contradicts the authors' previous RT-PCR results (Pradhan 2024 Fig. S4, Journal of Hazardous Materials) showing Ir7g expression in the labellum. Additionally the Benton and Carlson lab Ir51b-GAL4 reporters show no expression in adults as well. Please address these inconsistencies.

      With respect to Ir7g, we acknowledge that the Ir7g-GAL4 reporter line from the Benton lab does not exhibit detectable expression in adult labella. Furthermore, two independent transcriptomic studies—Dweck et al., 2021 (eLife) and Bontonou et al., 2024 (Nature Communications) also did not detect Ir7g transcripts in bulk RNA-seq datasets derived from adult labella. However, our previously published RT-PCR data (Pradhan et al., 2024, Journal of Hazardous Materials, Fig. S4) revealed Ir7g expression in labellar tissue, albeit at low levels. Our RT-PCR includes an internal control (tubulin) with the same reaction tube with control and the Ir7g mutant as a negative control. Therefore, we stand behind the findings that Ir7g is expressed in the labellum.

      We would like to point out that RT-PCR is more sensitive and better-suited to detect low-abundance transcripts than bulk RNA-seq, which may fail to capture transcripts due to limitations in depth of coverage. Moreover, immunohistochemistry can have limitations in detecting very low expression levels. Costa et al. 2013 (Translational lung cancer research) states that “RNA-Seq technique will not likely replace current RT-PCR methods, but will be complementary depending on the needs and the resources as the results of the RNA-Seq will identify those genes that need to then be examined using RT-PCR methods”.

      Similarly, regarding Ir51b, while the GAL4 reporter lines from the Benton and Carlson labs do not show robust adult expression, our RT-PCR and functional data strongly support a role for Ir51b in labellar bitter GRNs. Specifically, Ir51b<sup>1</sup> mutants display electrophysiological deficits in response to cholesterol (Figure 2A–B), and these defects are rescued by expressing Ir51b in Gr33a-positive bitter neurons (Figure 3G), providing functional validation of the RT-PCR expression.

      (4) The premise that high cholesterol intake is harmful to flies, which makes sensory mechanisms for cholesterol avoidance necessary, is interesting but underdeveloped. Animal sensory systems typically evolve to detect ecologically relevant stimuli with dynamic ranges matching environmental conditions. Given that Drosophila primarily consume fruits and plant matter (which contain minimal cholesterol) rather than animal-derived foods (which contain higher cholesterol), the ecological relevance of cholesterol detection requires more thorough discussion. Furthermore, at high concentrations, chemicals often activate multiple receptors beyond those specifically evolved for their detection. If the cholesterol concentrations used in this study substantially exceed those encountered in the fly's natural diet, the observed responses may represent an epiphenomenon rather than an ecologically and ethologically relevant sensory mechanism. What is the cholesterol content in flies' diet and how does that compare to the concentrations used in this paper?

      Drosophila melanogaster cannot synthesize sterols de novo, and must acquire them from its diet. In natural environments, flies acquire sterols from fermenting fruit, decaying plant matter, and yeast, which contain trace amounts of phytosterols (e.g., stigmasterol, β-sitosterol) and ergosterol. While the exact sterol concentrations in these sources remain uncharacterized, our behavioral assays used concentrations (0.001–0.01% by weight) that align with the low levels expected in such nutrient-limited ecological niches.

      In our study, the cholesterol concentrations tested ranged from 0.001% to 0.1%, thereby spanning both the physiologically relevant and slightly elevated range. Importantly, avoidance behaviors and receptor activation were most prominent at 0.1% cholesterol. While it is true that high chemical concentrations may elicit off-target effects via broad receptor activation, our genetic and electrophysiological data indicate that the observed responses are mediated by specific ionotropic receptors (Ir51b, Ir7g, Ir56d) and not merely generalized chemical stress.

      Ecologically, elevated sterol levels may also signal conditions unsuitable for egg-laying or larval development. For example, high levels of cholesterol or other sterols may occur in substrates colonized by pathogenic microbes, decaying animal tissue, or in cases of abnormal microbial fermentation, which could represent a nutritional or microbial hazard. The avoidance of cholesterol may help signal the flies to avoid consuming decaying animal tissue. In this context, sensory detection of excessive cholesterol might serve as a protective function.

      Reviewer #2 (Public review):

      Summary:

      In Cholesterol Taste Avoidance in Drosophila melanogaster, Pradhan et al. used behavioral and electrophysiological assays to demonstrate that flies can: (1) detect cholesterol through a subset of bitter-sensing gustatory receptor neurons (GRNs) and (2) avoid consuming food with high cholesterol levels. Mechanistically, they identified five members of the IR family as necessary for cholesterol detection in GRNs and for the corresponding avoidance behavior. Ectopic expression experiments further suggested that Ir7g + Ir56d or Ir51b + Ir56d may function as tuning receptors for cholesterol detection, together with the Ir25a and Ir76b co-receptors.

      Strengths:

      The experimental design of this study was logical and straightforward. Leveraging their expertise in the Drosophila taste system, the research team identified the molecular and cellular basis of a previously unrecognized taste category, expanding our understanding of gustation. A key strength of the study was its combination of electrophysiological recordings with behavioral genetic experiments.

      Weaknesses:

      My primary concern with this study is the lack of a systematic survey of the IRs of interest in the labellum GRNs. Consequently, there is no direct evidence linking the expression of putative cholesterol IRs to the B GRNs in the S6 and S7 sensilla.

      Specifically, the authors need to demonstrate that the IR expression pattern explains cholesterol sensitivity in the B GRNs of S6 and S7 sensilla, but not in other sensilla. Instead of providing direct IR expression data for all candidate IRs (as shown for Ir56d in Figure 2-figure supplement 1F), the authors rely on citations from several studies (Lee, Poudel et al. 2018; Dhakal, Sang et al. 2021; Pradhan, Shrestha et al. 2024) to support their claim that Ir7g, Ir25a, Ir51b, and Ir76b are expressed in B GRNs (Lines 192-194). However, none of these studies provide GAL4 expression or in situ hybridization data to substantiate this claim.

      Without a comprehensive IR expression profile for GRNs across all taste sensilla, it is difficult to interpret the ectopic expression results observed in the B GRN of the I9 sensillum or the A GRN of the L-sensillum (Figure 4). It remains equally plausible that other tuning IRs-beyond the co-receptor Ir25a and Ir76b-could interact with the ectopically expressed IRs to confer cholesterol sensitivity, rather than the proposed Ir7g + Ir56d or Ir51b + Ir56d combinations.

      We provide electrophysiological data demonstrating that the S6 and S7 sensilla respond to cholesterol (Figure 1D). This finding is consistent with the hypothesis that these sensilla harbor the complete receptor complexes necessary for cholesterol detection. In our electrophysiological recordings, only those bitter GRNs that co-express Ir56d along with either Ir7g or Ir51b generate action potentials in response to cholesterol. Other S-type sensilla lacking one or more of these subunits remain unresponsive, reinforcing the idea that these components are necessary for receptor function and sensory coding of cholesterol. Moreover, in the cholesterol-insensitive I9 sensillum (based on our mapping results using electrophysiology), co-expression of either Ir7g + Ir56d or Ir51b + Ir56d conferred de novo cholesterol sensitivity (Figure 4B). Importantly, no cholesterol response was observed when any of these IRs was expressed alone or when Ir7g + Ir51b were co-expressed without Ir56d. These findings strongly argue against the possibility that endogenous tuning IRs in I9 sensilla (e.g., Ir25a, Ir76b) are sufficient to generate cholesterol responsiveness.

      Furthermore, based on the literature, Ir25a and Ir76b are endogenously expressed in I- and L-type sensilla. Thus, their presence alone is insufficient for cholesterol responsiveness. These data support the model that cholesterol sensitivity depends on a specific, multi-subunit receptor complex (e.g., Ir7g + Ir25a + Ir56d + Ir76b or Ir51b + Ir25a + Ir56d + Ir76b).

      In conclusion, while we acknowledge that our data do not provide a full anatomical map of IR expression across all sensilla, our results strongly support the idea that cholesterol sensitivity in S6 and S7 sensilla arises from specific combinations of IRs expressed in the B GRNs.

      Reviewer #3 (Public review):

      Summary:

      Whether and how animals can taste cholesterol is not well understood. The study provides evidence that 1) cholesterol activates a subset of bitter-sensing gustatory receptor neurons (GRNs) in the fly labellum, but not other types of GRNs, 2) flies show aversion to high concentrations of cholesterol, and this is mediated by bitter GRNs, and 3) cholesterol avoidance depends on a specific set of ionotropic receptor (IR) subunits acting in bitter GRNs. The claims of the study are supported by electrophysiological recordings, genetic manipulations, and behavioral readouts.

      Strengths:

      Cholesterol taste has not been well studied, and the paper provides new insight into this question. The authors took a comprehensive and rigorous approach in several different parts of the paper, including screening the responses of all 31 labellar sensilla, screening a large panel of receptor mutants, and performing misexpression experiments with nearly every combination of the 5 IRs identified. The effects of the genetic manipulations are very clear and the results of electrophysiological and behavioral studies match nicely, for the most part. The appropriate controls are performed for all genetic manipulations.

      Weaknesses:

      The weaknesses of the study, described below, are relatively minor and do not detract from the main conclusions of the paper.

      (1) The paper does not state what concentrations of cholesterol are present in Drosophila's natural food sources. Are the authors testing concentrations that are ethologically Drosophila melanogaster primarily feeds on fermenting fruits and associated microbial communities, especially yeast, which serve as major sources of dietary sterols. These natural food sources are known to contain phytosterols such as stigmasterol and β-sitosterol. One study quantified phytosterols (e.g., stigmasterol, sitosterol) in fruits, reporting concentrations between 1.6–32.6 mg/100 g edible portion (~0.0016–0.0326% wet weight) (Han et al 2008). The range we tested falls within this range. Additionally, ergosterol, the principal sterol in yeast and a structural analog of cholesterol, is present at levels of about 0.005% to 0.02% in yeast-rich environments.

      To ensure physiological relevance, we designed our behavioral assays to include a broad concentration range of cholesterol, from 10<sup>-5</sup>% to 10<sup>-1</sup>%. This spans both physiological levels (0.001–0.01%), which are comparable to those found in the natural diet, and supra-physiological levels (e.g., 0.1%), which exceed natural exposure but help define the threshold for aversive behavior.

      Our results demonstrate that flies begin to avoid cholesterol at concentrations ≥10<sup>-3</sup>% more (Figure 3A), which falls within the upper physiological range and may reflect the threshold beyond which cholesterol or related sterols become deleterious. At these higher concentrations, excess sterols may disrupt membrane fluidity, interfere with hormone signaling, or promote microbial overgrowth—all of which could compromise fly health.

      (2) The paper does not state or show whether the expression of IR7g, IR51b, and IR56d is confined to bitter GRNs. Bitter-specific expression of at least some of these receptors would be necessary to explain why bitter GRNs but not sugar GRNs (or other GRN types) normally show cholesterol responses.

      We show the Ir56d-Gal4 is co-expressed with Gr66a-GFP in S6/S7 sensilla, indicating that it is expressed in bitter GRNs (Figure 2—figure supplement 1F). In the case of Ir7g and Ir51b, there are no reporters or antibodies to address expression. However, previously they have been shown to be expressed in bitter GRNs using RT-PCR (Dhakal et al. 2021, Communications Biology; Pradhan et al. 2024, Journal of Hazardous Materials). In addition, we provide functional evidence that bitter GRNs are required for the cholesterol response since silencing bitter GRNs abolishes cholesterol-induced action potentials (Figure 1E–F). Moreover, we showed that we could rescue the Ir7g<sup>1</sup>, Ir51b<sup>1</sup> and Ir56d<sup>1</sup> mutant phenotypes only when we expressed the cognate transgenes in bitter GRNs using the Gr33a-GAL4 (Figure 3G). Thus, while Ir7g/Ir51b are not exclusive to bitter GRNs, their functional role in cholesterol detection is bitter-GRN-specific.

      (3) The authors only investigated the responses of GRNs in the labellum, but GRN responses in the leg may also contribute to the avoidance of cholesterol feeding. Alternatively, leg GRNs might contribute to cholesterol attraction that is unmasked when bitter GRNs are silenced. In support of this possibility, Ahn et al. (2017) showed that Ir56d functions in sugar GRNs of the leg to promote appetitive responses to fatty acids.

      This is an interesting idea. Indeed, when bitter GRNs are hyperpolarized, the flies exhibit a strong attraction to cholesterol. Nevertheless, the cellular basis for cholesterol attraction and whether it is mediated by GRNs in the legs will require a future investigation.

      (4) The authors might consider using proboscis extension as an additional readout of taste attraction or aversion, which would help them more directly link the labellar GRN responses to a behavioral readout. Using food ingestion as a readout can conflate the contribution of taste with post-ingestive effects, and the regulation of food ingestion also may involve contributions from GRNs on multiple organs, whereas organ-specific contributions can be dissociated using proboscis extension. For example, does presenting cholesterol on the proboscis lead to aversive responses in the proboscis extension assay (e.g., suppression of responses to sugar)? Does this aversion switch to attraction when bitter GRNs are silenced, as with the feeding assay?

      We thank the reviewer for the suggestion regarding the use of the proboscis extension reflex (PER) assay to strengthen the link between labellar GRN activity and behavioral responses to cholesterol.

      Author response image 1.

      Our PER assay results shown above indicate that cholesterol presentation on the labellum or forelegs leads to an aversive response, as evidenced by a significant reduction in proboscis extension when compared to control stimuli (Author response image 1A. 2% sucrose or 2% sucrose with 10<sup>-1</sup>% cholesterol was applied to labellum or forelegs and the percent PER was recorded. n=6. Data were compared using single-factor ANOVA coupled with Scheffe’s post-hoc test. Statistical significance was compared with the control. Means ± SEMs. **p<0.01). This finding supports the idea that cholesterol is detected by labellar and leg GRNs and elicits behavioral avoidance. In contrast, sucrose stimulation robustly induces proboscis extension, as expected for an appetitive stimulus. We confirmed the defects of due to each Ir mutant by presenting the stimuli to the labellum (Author response image 1B). Together, these PER results provide a more direct behavioral correlate of labellar and leg GRN activation and reinforce our conclusion that cholesterol is sensed as an aversive tastant through the labellar bitter GRNs.

      (5) The authors claim that the cholesterol receptor is composed of IR25a, IR76b, IR56d, and either IR7g or IR51b. While the authors have shown that IR25a and IR76b are each required for cholesterol sensing, they did not show that both are required components of the same receptor complex. If the authors are relying on previous studies to make this assumption, they should state this more clearly. Otherwise, I think further misexpression experiments may be needed where only IR25a or IR76b, but not both, are expressed in GRNs.

      In our study, we relied on prior work demonstrating that Ir25a and Ir76b function as broadly required co-receptors in most IR-dependent chemosensory pathways (Ganguly et al., 2017; Lee et al., 2018). These studies showed that Ir25a and Ir76b are co-expressed in many GRNs across multiple taste modalities. Functional IR complexes often fail to form or signal properly in the absence of these co-receptors. Thus, it is widely accepted in the field that Ir25a and Ir76b function together as a core heteromeric scaffold for diverse IR complexes, akin to co-receptors in other ionotropic glutamate receptor families. We state that while Ir25a and Ir76b are presumed co-receptors in the cholesterol receptor complex based on their conserved roles, their direct physical interaction with Ir7g, Ir51b, and Ir56d remains to be demonstrated.

      In support of this model, we note that in our ectopic expression experiments using I9 sensilla, which endogenously express Ir25a and Ir76b, introduction of either Ir7g + Ir56d or Ir51b + Ir56d was sufficient to confer cholesterol sensitivity (Figure 4B). We obtained a similar result in L6 sensilla (Figure 4D), which also endogenously express Ir25a and Ir76b. These findings imply that both co-receptors are already present in these sensilla and are likely part of the functional complex. However, we agree that we have not directly tested the requirement for both co-receptors in a minimal reconstitution context, such as expressing only Ir25a or Ir76b alongside tuning IRs in an otherwise null background. Such an experiment would indeed provide more direct evidence of their joint requirement in the receptor complex. Future studies, including heterologous expression experiments, will be necessary to define the cholesterol-receptor complexes.

    1. Author Response

      Reviewer #1 (Public Review):

      The authors introduce a computational model that simulates the dendrites of developing neurons in a 2D plane, subject to constraints inspired by known biological mechanisms such as diffusing trophic factors, trafficked resources, and an activity-dependent pruning rule. The resulting arbors are analyzed in terms of their structure, dynamics, and responses to certain manipulations. The authors conclude that 1) their model recapitulates a stereotyped timecourse of neuronal development: outgrowth, overshoot, and pruning 2) Neurons achieve near-optimal wiring lengths, and Such models can be useful to test proposed biological mechanisms- for example, to ask whether a given set of growth rules can explain a given observed phenomenon - as developmental neuroscientists are working to understand the factors that give rise to the intricate structures and functions of the many cell types of our nervous system.

      Overall, my reaction to this work is that this is just one instantiation of many models that the author could have built, given their stated goals. Would other models behave similarly? This question is not well explored, and as a result, claims about interpreting these models and using them to make experimental predictions should be taken warily. I give more detailed and specific comments below.

      We thank the reviewer for the summary of the work. We find the criticism “that this is one instantiation of many models [we] could have built” can apply to any model. To quote George Box, “all models are wrong, but some models are useful” was the moto that drove our modeling approach. In principle, there are infinitely many possible models. We chose one of the most minimalistic models which implements known biological mechanisms including activity-independent and -dependent phases of dendritic growth, and constrained parameters based on experimental data. We compare the proposed model to other alternatives in the Discussion section, especially to the models of Hermann Cuntz which propose very different strategies for growth.

      However, the reviewer is right that within the type of model we chose, we could have more extensively explored the sensitivity to parameters. In the revised manuscript we will investigate the sensitivity of model output to variations of specific parameters, as explained below.

      Point 1.1. Line 109. After reading the rest of the manuscript, I worry about the conclusion voiced here, which implies that the model will extrapolate well to manipulations of all the model components. How were the values of model parameters selected? The text implies that these were selected to be biologically plausible, but many seem far off. The density of potential synapses, for example, seems very low in the simulations compared to the density of axons/boutons in the cortex; what constitutes a potential synapse? The perfect correlations between synapses in the activity groups is flawed, even for synapses belonging to the same presynaptic cell. The density of postsynaptic cells is also orders of magnitude of, etc. Ideally, every claim made about the model's output should be supported by a parameter sensitivity study. The authors performed few explorations of parameter sensitivity and many of the choices made seem ad hoc.

      It is indeed important to clarify how the model parameters were selected. Here we provide a short justification for some of these parameters, which will be included in the revised manuscript.

      1) Potential synapse density: We modelled 1,500 potential synapses in a cortical sheet of size 185x185 microns squared. We used 1 pixel per μm to capture approximately 1 μm thick dendrites. Therefore, we started with initial density of 0.044 potential synapses per μm^2. From Author Response Image 1 we can see that at the end of our simulation time ~1,000 potential synapses remain. So in fact, the density of potential synapses is totally sufficient, since not many potential synapses end up connected. The rapid slowing down of growth in our model is not due to a depletion of potential synaptic partners as the number of potential synapses remains high. Nonetheless, we will explore this in the revised manuscript. (this figure will be included in the revised submission):

      2) Stabilized synapse density: Since ~1,000 of the potential synapses in the modeled cortical sheet remain available, ~500 become connected to the dendrites of the 9 somas in the modeled cortical sheet. This means that the density of stable connected synapses is approximately 0.015 synapses per μm^2. This is also the number that is shown in Figure 3b, which is about 60 synapses stabilized per cell. This density is much easier to compare to experimental data, and below we provide some numbers from literature we already cited in the manuscript as well as a recent preprint.

      In the developing cortex:

      • Leighton, Cheyne and Lohmann 2023 https://doi.org/10.1101/2023.03.02.530772 find up to 0.4 synapses per μm in pyramidal neurons in vivo in the developing mouse visual cortex at P8 to P13. This is almost identical to our value of 0.4 synapses per μm.

      • Ultanir et al., 2007 https://doi.org/10.1073/pnas.0704031104 find 0.7 to 1.7 spines per μm in pyramidal neurons in vivo in L2/3 of the developing mouse cortex, at P10 to P20.

      • Glynn et al., 2011 https://doi.org/10.1038/nn.2764 find 0.1 to 0.7 spines per μm^2 in pyramidal neurons in vivo and in vitro in L2/3 of the developing mouse cortex, at P8 to P60.

      In the developing hippocampus:

      Although these values vary somewhat across experiments, in most cases they are in agreement with our chosen values, especially when taking into account that we are modeling development (rather than adulthood).

      3) Soma/neuron density: Indeed, we did not exactly mention this number anywhere in the paper. But from the figures we can infer 9 somas growing dendrites on an area of ~34,000 μm^2. Thus, neuron density would be 300 neurons per mm^2. This number seems a bit low after a short search through the literature. For e.g. Keller et al., 2018 https://www.frontiersin.org/articles/10.3389/fnana.2018.00083/full reports about 90,000 neurons per mm^3, albeit in adulthood.

      We are also performing a sensitivity analysis where some of these parameters are varied and will include this in the revised manuscript. In particular:

      (1) We will vary the nature of the input correlations. In the current model, the synapses in each correlated group receive spike trains with a perfect correlation and there are no correlations across the groups. We will reduce the correlations within group and add non-zero correlations across the groups.

      (2) We will vary the density of the neuronal somas. We expect that higher densities of somas will either yield smaller dendritic areas because the different neurons compete more or result in a state where nearby neurons have to complement each other regarding their activity preferences.

      (3) We will introduce dynamics in the potential synapses to model the dynamics of axons. We plan to explore several scenarios. We could introduce a gradual increase in the density of potential synapses and implement a cap on the number of synapses that can be alive at the same time, and vary that cap. We could also introduce a lifetime of each synapse (following for example a lognormal distribution). A potential synapse can disappear if it does not form a stable synapse in its lifetime, in which case it could move to a different location.

      Point 1.2. Many potentially important phenomena seem to be excluded. I realize that no model can be complete, but the choice of which phenomena to include or exclude from this model could bias studies that make use of it and is worth serious discussion. The development of axons is concurrent with dendrite outgrowth, is highly dynamic, and perhaps better understood mechanistically. In this model, the inputs are essentially static. Growing dendrites acquire and lose growth cones that are associated with rapid extension, but these do not seem to be modeled. Postsynaptic firing does not appear to be modeled, which may be critical to activity-dependent plasticity. For example, changes in firing are a potential explanation for the global changes in dendritic pruning that occur following the outgrowth phase.

      As the reviewer concludes, no model can be complete. In agreement with this, here we would like to quote a paragraph from a very nice paper by Larry Abbott (“Theoretical Neuroscience Rising, Neuron 2008 https://www.sciencedirect.com/science/article/pii/S0896627308008921) which although published more than 10 years ago, still applies today:

      “Identifying the minimum set of features needed to account for a particular phenomenon and describing these accurately enough to do the job is a key component of model building. Anything more than this minimum set makes the model harder to understand and more difficult to evaluate. The term ‘‘realistic’’ model is a sociological rather than a scientific term. The truly realistic model is as impossible and useless a concept as Borges’ ‘‘map of the empire that was of the same scale as the empire and that coincided with it point for point’’ (Borges, 1975). […] The art of modeling lies in deciding what this subset should be and how it should be described.”

      We have clearly stated in the Introduction (e.g. lines 37-75) which phenomena we include in the model and why. The Discussion also compares our model to others (lines 315-373), pointing out that most models either focus on activity-independent or activity-dependent phases. We include both, combining literature on molecular gradients and growth factors, with activity-dependent connectivity refinements instructed by spontaneous activity. We could not think of a more tractable, more minimalist model that would include both activity-independent or activity-dependent aspects. Therefore, we feel that the current manuscript provides sufficient motivation but also a discussion of limitations of the current model.

      Regarding including the concurrent development of axons, we agree this is very interesting and currently not addressed in the model. As noted at the bottom of our reply to point 1.1, bullet (3) we are now revising the manuscript to include a simplified form of axonal dynamics by allowing changes in the lifetime and location of potential synapses, which come from axons of presynaptic partners.

      Regarding postsynaptic firing, this is indeed super relevant and an important point to consider. In one of our recent publications (Kirchner and Gjorgjieva, 2021 https://www.nature.com/articles/s41467-021-23557-3), we studied only an activity-dependent model for the organization of synaptic inputs on non-growing dendrites which have a fixed length. There, we considered the effect of postsynaptic firing and demonstrated that it plays an important role in establishing a global organization of synapses on the entire dendritic tree of the neuron, and not just local dendritic branches. For example, we showed that could that it could lead to the emergence of retinotopic maps which have been found experimentally (Iacaruso et al., 2017 https://www.nature.com/articles/nature23019). Since we use the same activity-dependent plasticity model in this paper, we expect that the somatic firing will have the same effect on establishing synaptic distributions on the entire dendritic tree. We will make a note of this in the Discussion in the revised paper.

      Point 1.3. Line 167. There are many ways to include activity -independent and -dependent components into a model and not every such model shows stability. A key feature seems to be that larger arbors result in reduced growth and/or increased retraction, but this could be achieved in many ways (whether activity dependent or not). It's not clear that this result is due to the combination of activity-dependent and independent components in the model, or conceptually why that should be the case.

      We never argued for model uniqueness. There are always going to be many different models (at different spatial and temporal scales, at different levels of abstraction). We can never study all of them and like any modeling study in systems neuroscience we have chosen one model approach and investigated this approach. We do compare the current model to others in the Discussion. If the reviewers have a specific implementation that we should compare our model to as an alternative, we could try, but not if this means doing a completely separate project.

      Point 1.4. Line 183. The explanation of overshoot in terms of the different timescales of synaptic additions versus activity-dependent retractions was not something I had previously encountered and is an interesting proposal. Have these timescales been measured experimentally? To what extent is this a result of fine-tuning of simulation parameters?

      We found that varying the amount of BDNF controls the timescale of the activity-dependent plasticity (see our Figure 5c). Hence, changing the balance between synaptic additions vs. retractions is already explored in Figure 5e and f. Here we show that the overshoot and retraction does not have to be fine-tuned but may be abolished if there is too much activity-dependent plasticity.

      Regarding the relative timescales of synaptic additions vs. retractions: since the first is mainly due to activity-independent factors, and the second due to activity-dependent plasticity, the questions is really about the timescales of the latter two. As we write in the Introduction (lines 60-62), manipulating activity-dependent synaptic transmission has been found to not affect morphology but rather the density and specificity of synaptic connections (Ultanir et al. 2007 https://doi.org/10.1073/pnas.0704031104), supporting the sequential model we have (although we do not impose the sequence, as both activity-independent and activity-dependent mechanisms are always “on”; but note that activity-dependent plasticity can only operate on synapses that have already formed).

      Point 1.5. Line 203. This result seems at odds with results that show only a very weak bias in the tuning distribution of inputs to strongly tuned cortical neurons (e.g. work by Arthur Konnerth's group). This discrepancy should be discussed.

      First, we note that the correlated activity experienced by our modeled synapses (and resulting synaptic organization) does not necessarily correspond to visual orientation, or any stimulus feature, for that matter.

      Nonetheless, this is a very interesting question and there is some variability in what the experimental data show. Many studies have shown that synapses on dendrites are organized into functional synaptic clusters: across brain regions, developmental ages and diverse species from rodent to primate (Kleindienst et al. 2011; Takahashi et al. 2012; Winnubst et al. 2015; Gökçe et al., 2016; Wilson et al. 2016; Iacaruso et al., 2017; Scholl et al., 2017; Niculescu et al. 2018; Kerlin et al. 2019; Ju et al. 2020). Interestingly, some in vivo studies have reported lack of fine-scale synaptic organization (Varga et al. 2011; X. Chen et al. 2011; T.-W. Chen et al. 2013; Jia et al. 2010; Jia et al. 2014), while others reported clustering for different stimulus features in different species. For example, dendritic branches in the ferret visual cortex exhibit local clustering of orientation selectivity but do not exhibit global organization of inputs according to spatial location and receptive field properties (Wilson et al. 2016; Scholl et al., 2017). In contrast, synaptic inputs in mouse visual cortex do not cluster locally by orientation, but only by receptive field overlap, and exhibit a global retinotopic organization along the proximal-distal axis (Iacaruso et al., 2017). We proposed a theoretical framework to reconcile these data: combining activity-dependent plasticity similar to the BDNF-proBDNF model that we used in the current work, and a receptive field model for the different species (Kirchner and Gjorgjieva, 2021 https://www.nature.com/articles/s41467-021-23557-3). We can mention this aspect in the revised manuscript.

      Point 1.6. Line 268. How does the large variability in the size of the simulated arbors relate to the relatively consistent size of arbors of cortical cells of a given cell type? This variability suggests to me that these simulations could be sensitive to small changes in parameters (e.g. to the density or layout of presynapses).

      As noted at the bottom of our reply to point 1.1, bullet (3) we are now revising the manuscript to include changes in the lifetime and location of potential synapses.

      Point 1.7. The modeling of dendrites as two-dimensional will likely limit the usefulness of this model. Many phenomena- such as diffusion, random walks, topological properties, etc - fundamentally differ between two and three dimensions.

      The reviewer is right about there being differences between two and three dimensions. But a simpler model does not mean a useless model even if not completely realistic. We have ongoing work that extends the current model to 3D but is beyond the scope of the current paper. In systems neuroscience, people have found very interesting results making such simplified geometric assumptions about networks, for instance the one-dimensional ring model has been used to uncover fundamental insights about computations even though highly simplified and abstracted.

      Point 1.8. The description of wiring lengths as 'approximately optimal' in this text is problematic. The plotted data show that the wiring lengths are several deviations away from optimal, and the random model is not a valid instantiation of the 2D non-overlapping constraints the authors imposed. A more appropriate null should be considered.

      We did not use the term “optimal” in line with previous literature. We wrongly referred to the minimal wiring length as the optimal wiring length, but neurons can optimize their wiring not only by minimizing their dendritic length (e.g. work of Hermann Cuntz). In the revised manuscript, we will replace the term “optimal wiring” with “minimal wiring”. Then we will compare the wiring length in the model with the theoretically minimal wiring length, the random wiring length and the actual data.

      Point 1.9. It's not clear to me what the authors are trying to convey by repeatedly labeling this model as 'mechanistic'. The mechanisms implemented in the model are inspired by biological phenomena, but the implementations have little resemblance to the underlying biophysical mechanisms. Overall my impression is that this is a phenomenological model intended to show under what conditions particular patterns are possible. Line 363, describing another model as computational but not mechanistic, was especially unclear to me in this context.

      What we mean by mechanistic is that we implement equations that model specific mechanisms i.e. we have a set of equations that implement the activity-independent attraction to potential synapses (with parameters such as the density of synapses, their spatial influence, etc) and the activity-dependent refinement of synapses (with parameters such as the ratio of BDNF and proBDNF to induce potentiation vs depression, the activity-dependent conversion of one factor to the other, etc). This is a bottom-up approach where we combine multiple elements together to get to neuronal growth and synaptic organization. This approach is in stark contrast to the so-called top-down or normative approaches where the method would involve defining an objective function (e.g. minimal dendritic length) which depends on a set of parameters and then applying a gradient descent or other mathematical optimization technique to get at the parameters that optimize the objective function. This latter approach we would not call mechanistic because it involves an abstract objective function (who could say what a neuron or a circuit should be trying to optimize) and a mathematical technique for how to optimize the function (we don’t know of neurons can compute gradients of abstract objective functions).

      Hence our model is mechanistic, but it does operate at a particular level of abstraction/simplification. We don’t model individual ion channels, or biophysics of synaptic plasticity (opening and closing of NMDA channels, accumulation of proteins at synapses, protein synthesis). We do, however, provide a biophysical implementation of the plasticity mechanism though the BDNF/proBDNF model which is more than most models of plasticity achieve, because they typically model a phenomenological STDP or Hebbian rule that just uses activity patterns to potential or depress synaptic weights, disregarding how it could be implemented.

      Reviewer #2 (Public Review):

      This work combines a model of two-dimensional dendritic growth with attraction and stabilisation by synaptic activity. The authors find that constraining growth models with competition for synaptic inputs produces artificial dendrites that match some key features of real neurons both over development and in terms of final structure. In particular, incorporating distance-dependent competition between synapses of the same dendrite naturally produces distinct phases of dendritic growth (overshoot, pruning, and stabilisation) that are observed biologically and leads to local synaptic organisation with functional relevance. The approach is elegant and well-explained, but makes some significant modelling assumptions that might impact the biological relevance of the results.

      Strengths:

      The main strength of the work is the general concept of combining morphological models of growth with synaptic plasticity and stabilisation. This is an interesting way to bridge two distinct areas of neuroscience in a manner that leads to findings that could be significant for both. The modelling of both dendritic growth and distance-dependent synaptic competition is carefully done, constrained by reasonable biological mechanisms, and well-described in the text. The paper also links its findings, for example in terms of phases of dendritic growth or final morphological structure, to known data well.

      Weaknesses:

      The major weaknesses of the paper are the simplifying modelling assumptions that are likely to have an impact on the results. These assumptions are not discussed in enough detail in the current version of the paper.

      1) Axonal dynamics.

      A major, and lightly acknowledged, assumption of this paper is that potential synapses, which must come from axons, are fixed in space. This is not realistic for many neural systems, as multiple undifferentiated neurites typically grow from the soma before an axon is specified (Polleux & Snider, 2010). Further, axons are also dynamic structures in early development and, at least in some systems, undergo activity-dependent morphological changes too (O'Leary, 1987; Hall 2000). This paper does not consider the implications of joint pre- and post-synaptic growth and stabilisation.

      We thank the reviewer for the summary of the strengths and weaknesses of the work. While we feel that including a full model of axonal dynamics is beyond the scope of the current manuscript, some aspects of axonal dynamics can be included. In a revised model, we will introduce a gradual increase in the density of potential synapses and implement a cap on the number of synapses that can be alive at the same time, and vary that cap. We plan to also introduce a lifetime of each synapse (following for example a lognormal distribution). A potential synapse can disappear if it does not form a stable synapse in its lifetime, in which case it could move to a different location. See also our reply to reviewer comment 1.1, bullet (3).

      2) Activity correlations

      On a related note, the synapses in the manuscript display correlated activity, but there is no relationship between the distance between synapses and their correlation. In reality, nearby synapses are far more likely to share the same axon and so display correlated activity. If the input activity is spatially correlated and synaptic plasticity displays distance-dependent competition in the dendrites, there is likely to be a non-trivial interaction between these two features with a major impact on the organisation of synaptic contacts onto each neuron.

      We are exploring the amount of correlation (between and within correlated groups) to include in the revised manuscript (see also our reply to reviewer comment 1.1, bullet (1)).

      However, previous experimental work, (Kleindienst et al., 2011 https://doi.org/10.1016/j.neuron.2011.10.015) has provided anatomical and functional analyses that it is unlikely that the functional synaptic clustering on dendritic branches is the result of individual axons making more than one synapse (see pg. 1019).

      3) BDNF dynamics

      The models are quite sensitive to the ratio of BDNF to proBDNF (eg Figure 5c). This ratio is also activity-dependent as synaptic activation converts proBDNF into BDNF. The models assume a fixed ratio that is not affected by synaptic activity. There should at least be more justification for this assumption, as there is likely to be a positive feedback relationship between levels of BDNF and synaptic activation.

      The reviewer is correct. We used the BDNF-proBDNF model for synaptic plasticity based on our previous work: Kirchner and Gjorgjieva, 2021 https://www.nature.com/articles/s41467-021-23557-3.

      There, we explored only the emergence of functionally clustered synapses on static dendrites which do not grow. In the Methods section (Parameters and data fitting) we justify the choice of the ratio of BDNF to proBDNF from published experimental work. We also performed sensitivity analysis (Supplementary Fig. 1) and perturbation simulations (Supplementary Fig. 3), which showed that the ratio is crucial in regulating the overall amount of potentiation and depression of synaptic efficacy, and therefore has a strong impact on the emergence and maintenance of synaptic organization. Since we already performed all this analysis, we do not expect there will be any differences in the current model which includes dendritic growth, as the activity-dependent mechanism has such a different timescale.

      A further weakness is in the discussion of how the final morphologies conform to principles of optimal wiring, which is quite imprecise. 'Optimal wiring' in the sense of dendrites and axons (Cajal, 1895; Chklovskii, 2004; Cuntz et al, 2007, Budd et al, 2010) is not usually synonymous with 'shortest wiring' as implied here. Instead, there is assumed to be a balance between minimising total dendritic length and minimising the tree distance (ie Figure 4c here) between synapses and the site of input integration, typically the soma. The level of this balance gives the deviation from the theoretical minimum length as direct paths to synapses typically require longer dendrites. In the model this is generated by the guidance of dendritic growth directly towards the synaptic targets. The interpretation of the deviation in this results section discussing optimal wiring, with hampered diffusion of signalling molecules, does not seem to be correct.

      We agree with this comment. We had wrongly used the term “optimal wiring” as neurons can optimize their wiring not only by minimizing their dendritic length but other factors as noted by the reviewer. In the revised manuscript will replace the term “optimal wiring” with “minimal wiring” and discuss these differences to previous work.

      Reviewer #3 (Public Review):

      The authors propose a mechanistic model of how the interplay between activity-independent growth and an activity-dependent synaptic strengthening/weaken model influences the dendrite shape, complexity and distribution of synapses. The authors focus on a model for stellate cells, which have multiple dendrites emerging from a soma. The activity independent component is provided by a random pool of presynaptic sites that represent potential synapses and that release a diffusible signal that promotes dendritic growth. Then a spontaneous activity pattern with some correlation structure is imposed at those presynaptic sites. The strength of these synapses follow a learning rule previously proposed by the lab: synapses strengthen when there is correlated firing across multiple sites, and synapses weaken if there is uncorrelated firing with the relative strength of these processes controlled by available levels of BDNF/proBDNF. Once a synapse is weakened below a threshold, the dendrite branch at that site retracts and loses its sensitivity to the growth signal

      The authors run the simulation and map out how dendrites and synapses evolve and stabilize. They show that dendritic trees growing rapidly and then stabilize by balancing growth and retraction (Figure 2). They also that there is an initial bout of synaptogenesis followed by loss of synapses, reflecting the longer amount of time it takes to weaken a synapse (Figure 3). They analyze how this evolution of dendrites and synapses depends on the correlated firing of synapses (i.e. defined as being in the same "activity group"). They show that in the stabilized phase, synapses that remain connected to a given dendritic branch are likely to be from same activity group (Figure 4). The authors systemically alter the learning rule by changing the available concentration of BDNF, which alters the relative amount of synaptic strengthening, which in turn affects stabilization, density of synapses and interestingly how selective for an activity group one dendrite is (Figure 5). In addition the authors look at how altering the activity-independent factors influences outgrowth (Figure 6). Finally, one of the interesting outcomes is that the resulting dendritic trees represent "optimal wiring" solutions in the sense that dendrites use the shortest distance given the distribution of synapses. They compare this distribute to one published data to see how the model compared to what has been observed experimentally.

      There are many strengths to this study. The consequence of adding the activity-dependent contribution to models of synapto- and dendritogenesis is novel. There is some exploration of parameters space with the motivation of keeping the parameters as well as the generated outcomes close to anatomical data of real dendrites. The paper is also scholarly in its comparison of this approach to previous generative models. This work represented an important advance to our understanding of how learning rules can contribute to dendrite morphogenesis

      We thank the reviewer for the positive evaluation of the work and the suggestions below.

    1. Author response:

      Reviewer #1 (Evidence, reproducibility and clarity):

      Authors has provided a mechanism by which how presence of truncated P53 can inactivate function of full length P53 protein. Authors proposed this happens by sequestration of full length P53 by truncated P53.

      In the study, performed experiments are well described.

      My area of expertise is molecular biology/gene expression, and I have tried to provide suggestions on my area of expertise. The study has been done mainly with overexpression system and I have included few comments which I can think can be helpful to understand effect of truncated P53 on endogenous wild type full length protein. Performing experiments on these lines will add value to the observation according to this reviewer.

      Major comments:

      (1) What happens to endogenous wild type full length P53 in the context of mutant/truncated isoforms, that is not clear. Using a P53 antibody which can detect endogenous wild type P53, can authors check if endogenous full length P53 protein is also aggregated as well? It is hard to differentiate if aggregation of full length P53 happens only in overexpression scenario, where lot more both of such proteins are expressed. In normal physiological condition P53 expression is usually low, tightly controlled and its expression get induced in altered cellular condition such as during DNA damage. So, it is important to understand the physiological relevance of such aggregation, which could be possible if authors could investigate effect on endogenous full length P53 following overexpression of mutant isoforms.

      Thank you very much for your insightful comments.

      (1) To address “what happens to endogenous wild-type full-length P53 in the context of mutant/truncated isoforms," we employed a human A549 cell line expressing endogenous wild-type p53 under DNA damage conditions such as an etoposide treatment(1). We choose the A549 cell line since similar to H1299, it is a lung cancer cell line (www.atcc.org). For comparison, we also transfected the cells with 2 μg of V5-tagged plasmids encoding FLp53 and its isoforms Δ133p53 and Δ160p53. As shown in Author response image 1A, lanes 1 and 2, endogenous p53 expression, remained undetectable in A549 cells despite etoposide treatment, which limits our ability to assess the effects of the isoforms on the endogenous wild-type FLp53. We could, however, detect the V5-tagged FLp53 expressed from the plasmid using anti-V5 (rabbit) as well as with antiDO-1 (mouse) antibody (Author response image 1). The latter detects both endogenous wildtype p53 and the V5-tagged FLp53 since the antibody epitope is within the Nterminus (aa 20-25). This result supports the reviewer’s comment regarding the low level of expression of endogenous p53 that is insufficient for detection in our experiments.   

      In summary, in line with the reviewer’s comment that ‘under normal physiological conditions p53 expression is usually low,’ we could not detect p53 with an anti-DO-1 antibody. Thus, we proceeded with V5/FLAG-tagged p53 for detection of the effects of the isoforms on p53 stability and function. We also found that protein expression in H1299 cells was more easily detectable than in A549 cells (Compare Author response image 1A and B). Thus, we decided to continue with the H1299 cells (p53-null), which would serve as a more suitable model system for this study.  

      (2) We agree with the reviewer that ‘It is hard to differentiate if aggregation of full-length p53 happens only in overexpression scenario’. However, it is not impossible to imagine that such aggregation of FLp53 happens under conditions when p53 and its isoforms are over-expressed in the cell. Although the exact physiological context is not known and beyond the scope of the current work, our results indicate that at higher expression, p53 isoforms drive aggregation of FLp53. Given the challenges of detecting endogenous FLp53, we had to rely on the results obtained with plasmid mediated expression of p53 and its isoforms in p53-null cells.

      Author response image 1.

      Comparative analysis of protein expression in A549 and H1299 cells. (A) A549 cells (p53 wild-type) were treated with etoposide to induce endogenous wild-type p53 expression. To assess the effects of FLp53 and its isoforms Δ133p53 and Δ160p53 on endogenous wild-type p53 aggregation, A549 cells were transfected with 2 μg of V5-tagged p53 expression plasmids, with or without etoposide (20μM for 8h) treatment. Western blot analysis was done with the anti-V5 (rabbit) to detect V5-tagged proteins and anti-DO-1 (mouse), the latter detects both endogenous wild-type p53 and V5-tagged FLp53. The merged image corresponds to the overlay between the V5 and DO1 antibody signals. (B) H1299 cells (p53-null) were transfected with 2 μg V5tagged p53 expression plasmids or the empty vector control pcDNA3.1. Western blot analysis was done with the anti-V5 (mouse) antibody. 

      (2) Can presence of mutant P53 isoforms can cause functional impairment of wild type full length endogenous P53? That could be tested as well using similar ChIP assay authors has performed, but instead of antibody against the Tagged protein if the authors could check endogenous P53 enrichment in the gene promoter such as P21 following overexpression of mutant isoforms. May be introducing a condition such as DNA damage in such experiment might help where endogenous P53 is induced and more prone to bind to P53 target such as P21.

      Thank you very much for your valuable comments and suggestions. To investigate the potential functional impairment of endogenous wild-type p53 by p53 isoforms, we initially utilized A549 cells (p53 wild-type), aiming to monitor endogenous wild-type p53 expression following DNA damage. However, as mentioned and demonstrated in Author response image 1, endogenous p53 expression was too low to be detected under these conditions, making the ChIP assay for analyzing endogenous p53 activity unfeasible. Thus, we decided to utilize plasmid-based expression of FLp53 and focus on the potential functional impairment induced by the isoforms.

      (3) On similar lines, authors described:

      "To test this hypothesis, we escalated the ratio of FLp53 to isoforms to 1:10. As expected, the activity of all four promoters decreased significantly at this ratio (Figure 4A-D). Notably, Δ160p53 showed a more potent inhibitory effect than Δ133p53 at the 1:5 ratio on all promoters except for the p21 promoter, where their impacts were similar (Figure 4E-H). However, at the 1:10 ratio, Δ133p53 and Δ160p53 had similar effects on all transactivation except for the MDM2 promoter (Figure 4E-H)."

      Again, in such assay authors used ratio 1:5 to 1:10 full length vs mutant. How authors justify this result in context (which is more relevant context) where one allele is Wild type (functional P53) and another allele is mutated (truncated, can induce aggregation). In this case one would except 1:1 ratio of full-length vs mutant protein, unless other regulation is going which induces expression of mutant isoforms more than wild type full length protein. Probably discussing on these lines might provide more physiological relevance to the observed data.

      Thank you for raising this point regarding the physiological relevance of the ratios used in our study.

      (1) In the revised manuscript (lines 193-195), we added in this direction that “The elevated Δ133p53 protein modulates p53 target genes such as miR‑34a and p21, facilitating cancer development(2, 3). To mimic conditions where isoforms are upregulated relative to FLp53, we increased the ratios to 1:5 and 1:10.” This approach aims to simulate scenarios where isoforms accumulate at higher levels than FLp53, which may be relevant in specific contexts, as also elaborated above.

      (2) Regarding the issue of protein expression, where one allele is wild-type and the other is isoform, this assumption is not valid in most contexts. First, human cells have two copies of TPp53 gene (one from each parent). Second, the TP53 gene has two distinct promoters: the proximal promoter (P1) primarily regulates FLp53 and ∆40p53, whereas the second promoter (P2) regulates ∆133p53 and ∆160p53(4, 5). Additionally, ∆133TP53 is a p53 target gene(6, 7) and the expression of Δ133p53 and FLp53 is dynamic in response to various stimuli. Third, the expression of p53 isoforms is regulated at multiple levels, including transcriptional, post-transcriptional, translational, and post-translational processing(8). Moreover, different degradation mechanisms modify the protein level of p53 isoforms and FLp53(8). These differential regulation mechanisms are regulated by various stimuli, and therefore, the 1:1 ratio of FLp53 to ∆133p53 or ∆160p53 may be valid only under certain physiological conditions. In line with this, varied expression levels of FLp53 and its isoforms, including ∆133p53 and ∆160p53, have been reported in several studies(3, 4, 9, 10). 

      (3) In our study, using the pcDNA 3.1 vector under the human cytomegalovirus (CMV) promoter, we observed moderately higher expression levels of ∆133p53 and ∆160p53 relative to FLp53 (Author response image 1B). This overexpression scenario provides a model for studying conditions where isoform accumulation might surpass physiological levels, impacting FLp53 function. By employing elevated ratios of these isoforms to FLp53, we aim to investigate the potential effects of isoform accumulation on FLp53.

      (4) Finally does this altered function of full length P53 (preferably endogenous one) in presence of truncated P53 has any phenotypic consequence on the cells (if authors choose a cell type which is having wild type functional P53). Doing assay such as apoptosis/cell cycle could help us to get this visualization.

      Thank you for your insightful comments. In the experiment with A549 cells (p53 wild-type), endogenous p53 levels were too low to be detected, even after DNA damage induction. The evaluation of the function of endogenous p53 in the presence of isoforms is hindered, as mentioned above. In the revised manuscript, we utilized H1299 cells with overexpressed proteins for apoptosis studies using the Caspase-Glo® 3/7 assay (Figure 7). This has been shown in the Results section (lines 254-269). “The Δ133p53 and Δ160p53 proteins block pro-apoptotic function of FLp53.

      One of the physiological read-outs of FLp53 is its ability to induce apoptotic cell death(11). To investigate the effects of p53 isoforms Δ133p53 and Δ160p53 on FLp53-induced apoptosis, we measured caspase-3 and -7 activities in H1299 cells expressing different p53 isoforms (Figure 7). Caspase activation is a key biochemical event in apoptosis, with the activation of effector caspases (caspase-3 and -7) ultimately leading to apoptosis(12). The caspase-3 and -7 activities induced by FLp53 expression was approximately 2.5 times higher than that of the control vector (Figure 7). Co-expression of FLp53 and the isoforms Δ133p53 or Δ160p53 at a ratio of 1: 5 significantly diminished the apoptotic activity of FLp53 (Figure 7). This result aligns well with our reporter gene assay, which demonstrated that elevated expression of Δ133p53 and Δ160p53 impaired the expression of apoptosis-inducing genes BAX and PUMA (Figure 4G and H). Moreover, a reduction in the apoptotic activity of FLp53 was observed irrespective of whether Δ133p53 or Δ160p53 protein was expressed with or without a FLAG tag (Figure 7). This result, therefore, also suggests that the FLAG tag does not affect the apoptotic activity or other physiological functions of FLp53 and its isoforms. Overall, the overexpression of p53 isoforms Δ133p53 and Δ160p53 significantly attenuates FLp53-induced apoptosis, independent of the protein tagging with the FLAG antibody epitope.”

      Referees cross-commenting

      I think the comments from the other reviewers are very much reasonable and logical.

      Especially all 3 reviewers have indicated, a better way to visualize the aggregation of full-length wild type P53 by truncated P53 (such as looking at endogenous P53# by reviewer 1, having fluorescent tag #by reviewer 2 and reviewer 3 raised concern on the FLAG tag) would add more value to the observation.

      Thank you for these comments. The endogenous p53 protein was undetectable in A549 cells induced by etoposide (Figure R1A). Therefore, we conducted experiments using FLAG/V5-tagged FLp53.  To avoid any potential side effects of the FLAG tag on p53 aggregation, we introduced untagged p53 isoforms in the H1299 cells and performed subcellular fractionation. Our revised results, consistent with previous FLAG-tagged p53 isoforms findings, demonstrate that co-expression of untagged isoforms with FLAG-tagged FLp53 significantly induced the aggregation of FLAG-FLp53, while no aggregation was observed when FLAG-tagged FLp53 was expressed alone (Supplementary Figure 6). These results clearly indicate that the FLAG tag itself does not contribute to protein aggregation. 

      Additionally, we utilized the A11 antibody to detect protein aggregation, providing additional validation (Figure 8 from Jean-Christophe Bourdon et al. Genes Dev. 2005;19:2122-2137). Given that the fluorescent proteins (~30 kDa) are substantially bigger than the tags used here (~1 kDa) and may influence oligomerization (especially GFP), stability, localization, and function of p53 and its isoforms, we avoided conducting these vital experiments with such artificial large fusions. 

      Reviewer #1 (Significance):

      The work in significant, since it points out more mechanistic insight how wild type full length P53 could be inactivated in the presence of truncated isoforms, this might offer new opportunity to recover P53 function as treatment strategies against cancer.

      Thank you for your insightful comments. We appreciate your recognition of the significance of our work in providing mechanistic insights into how wild-type FLp53 can be inactivated by truncated isoforms. We agree that these findings have potential for exploring new strategies to restore p53 function as a therapeutic approach against cancer. 

      Reviewer #2 (Evidence, reproducibility and clarity):

      The manuscript by Zhao and colleagues presents a novel and compelling study on the p53 isoforms, Δ133p53 and Δ160p53, which are associated with aggressive cancer types. The main objective of the study was to understand how these isoforms exert a dominant negative effect on full-length p53 (FLp53). The authors discovered that the Δ133p53 and Δ160p53 proteins exhibit impaired binding to p53-regulated promoters. The data suggest that the predominant mechanism driving the dominant-negative effect is the coaggregation of FLp53 with Δ133p53 and Δ160p53.

      This study is innovative, well-executed, and supported by thorough data analysis. However, the authors should address the following points:

      (1) Introduction on Aggregation and Co-aggregation: Given that the focus of the study is on the aggregation and co-aggregation of the isoforms, the introduction should include a dedicated paragraph discussing this issue. There are several original research articles and reviews that could be cited to provide context.

      Thank you very much for the valuable comments. We have added the following paragraph in the revised manuscript (lines 74-82): “Protein aggregation has become a central focus of modern biology research and has documented implications in various diseases, including cancer(13, 14, 15). Protein aggregates can be of different types ranging from amorphous aggregates to highly structured amyloid or fibrillar aggregates, each with different physiological implications. In the case of p53, whether protein aggregation, and in particular, co-aggregation with large N-terminal deletion isoforms, plays a mechanistic role in its inactivation is yet underexplored. Interestingly, the Δ133p53β isoform has been shown to aggregate in several human cancer cell lines(16). Additionally, the Δ40p53α isoform exhibits a high aggregation tendency in endometrial cancer cells(17). Although no direct evidence exists for Δ160p53 yet, these findings imply that p53 isoform aggregation may play a major role in their mechanisms of actions.”

      (2) Antibody Use for Aggregation: To strengthen the evidence for aggregation, the authors should consider using antibodies that specifically bind to aggregates.

      Thank you for your insightful suggestion. We addressed protein aggregation using the A11 antibody which specifically recognizes amyloid-like protein aggregates. We analyzed insoluble nuclear pellet samples prepared under identical conditions as described in Figure 6B. To confirm the presence of p53 proteins, we employed the anti-p53 M19 antibody (Santa Cruz, Cat No. sc-1312) to detect bands corresponding to FLp53 and its isoforms Δ133p53 and Δ160p53. The monomer FLp53 was not detected (Figure 8, lower panel, Jean-Christophe Bourdon et al. Genes Dev. 2005;19:2122-2137), which may be attributed to the lower binding affinity of the anti-p53 M19 antibody to it. These samples were also immunoprecipitated using the A11 antibody (Thermo Fischer Scientific, Cat No. AHB0052) to detect aggregated proteins. Interestingly, FLp53 and its isoforms, Δ133p53 and Δ160p53, were clearly visible with Anti-A11 antibody when co-expressed at a 1:5 ratio suggesting that they underwent co-aggregation. However, no FLp53 aggregates were observed when it was expressed alone (Author response image 2). These results support the conclusion in our manuscript that Δ133p53 and Δ160p53 drive FLp53 aggregation. 

      Author response image 2.

      Induction of FLp53 Aggregation by p53 Isoforms Δ133p53 and Δ160p53. H1299 cells transfected with the FLAG-tagged FLp53 and V5-tagged Δ133p53 or Δ160p53 at a 1:5 ratio. The cells were subjected to subcellular fractionation, and the resulting insoluble nuclear pellet was resuspended in RIPA buffer. The samples were heated at 95°C until the pellet was completely dissolved, and then analyzed by Western blotting. Immunoprecipitation was performed using the A11 antibody, which specifically recognizes amyloid protein aggregates, and the anti-p53 M19 antibody, which detects FLp53 as well as its isoforms Δ133p53 and Δ160p53. 

      (3) Fluorescence Microscopy: Live-cell fluorescence microscopy could be employed to enhance visualization by labeling FLp53 and the isoforms with different fluorescent markers (e.g., EGFP and mCherry tags).

      We appreciate the suggestion to use live-cell fluorescence microscopy with EGFP and mCherry tags for the visualization FLp53 and its isoforms. While we understand the advantages of live-cell imaging with EGFP / mCherry tags, we restrained us from doing such fusions as the GFP or corresponding protein tags are very big (~30 kDa) with respect to the p53 isoform variants (~30 kDa).  Other studies have shown that EGFP and mCherry fusions can alter protein oligomerization, solubility and aggregation(18, 19) Moreover, most fluorescence proteins are prone to dimerization (i.e. EGFP) or form obligate tetramers (DsRed)(20, 21, 22), potentially interfering with the oligomerization and aggregation properties of p53 isoforms, particularly Δ133p53 and Δ160p53.

      Instead, we utilized FLAG- or V5-tag-based immunofluorescence microscopy, a well-established and widely accepted method for visualizing p53 proteins. This method provided precise localization and reliable quantitative data, which we believe meet the needs of the current study. We believe our chosen method is both appropriate and sufficient for addressing the research question.

      Reviewer #2 (Significance):

      The manuscript by Zhao and colleagues presents a novel and compelling study on the p53 isoforms, Δ133p53 and Δ160p53, which are associated with aggressive cancer types. The main objective of the study was to understand how these isoforms exert a dominant negative effect on full-length p53 (FLp53). The authors discovered that the Δ133p53 and Δ160p53 proteins exhibit impaired binding to p53-regulated promoters. The data suggest that the predominant mechanism driving the dominant-negative effect is the coaggregation of FLp53 with Δ133p53 and Δ160p53.

      We sincerely thank the reviewer for the thoughtful and positive comments on our manuscript and for highlighting the significance of our findings on the p53 isoforms, Δ133p53 and Δ160p53. 

      Reviewer #3 (Evidence, reproducibility and clarity):

      In this manuscript entitled "Δ133p53 and Δ160p53 isoforms of the tumor suppressor protein p53 exert dominant-negative effect primarily by coaggregation", the authors suggest that the Δ133p53 and Δ160p53 isoforms have high aggregation propensity and that by co-aggregating with canonical p53 (FLp53), they sequestrate it away from DNA thus exerting a dominantnegative effect over it.

      First, the authors should make it clear throughout the manuscript, including the title, that they are investigating Δ133p53α and Δ160p53α since there are 3 Δ133p53 isoforms (α, β, γ), and 3 Δ160p53 isoforms (α, β, γ).

      Thank you for your suggestion. We understand the importance of clearly specifying the isoforms under study. Following your suggestion, we have added α in the title, abstract, and introduction and added the following statement in the Introduction (lines 57-59): “For convenience and simplicity, we have written Δ133p53 and Δ160p53 to represent the α isoforms (Δ133p53α and Δ160p53α) throughout this manuscript.” 

      One concern is that the authors only consider and explore Δ133p53α and Δ160p53α isoforms as exclusively oncogenic and FLp53 dominant-negative while not discussing evidences of different activities. Indeed, other manuscripts have also shown that Δ133p53α is non-oncogenic and non-mutagenic, do not antagonize every single FLp53 functions and are sometimes associated with good prognosis. To cite a few examples:

      (1) Hofstetter G. et al. D133p53 is an independent prognostic marker in p53 mutant advanced serous ovarian cancer. Br. J. Cancer 2011, 105, 15931599.

      (2) Bischof, K. et al. Influence of p53 Isoform Expression on Survival in HighGrade Serous Ovarian Cancers. Sci. Rep. 2019, 9,5244.

      (3) Knezovi´c F. et al. The role of p53 isoforms' expression and p53 mutation status in renal cell cancer prognosis. Urol. Oncol. 2019, 37, 578.e1578.e10.

      (4) Gong, L. et al. p53 isoform D113p53/D133p53 promotes DNA doublestrand break repair to protect cell from death and senescence in response to DNA damage. Cell Res. 2015, 25, 351-369.

      (5) Gong, L. et al. p53 isoform D133p53 promotes efficiency of induced pluripotent stem cells and ensures genomic integrity during reprogramming. Sci. Rep. 2016, 6, 37281.

      (6) Horikawa, I. et al. D133p53 represses p53-inducible senescence genes and enhances the generation of human induced pluripotent stem cells. Cell Death Differ. 2017, 24, 1017-1028.

      (7) Gong, L. p53 coordinates with D133p53 isoform to promote cell survival under low-level oxidative stress. J. Mol. Cell Biol. 2016, 8, 88-90.

      Thank you very much for your comment and for highlighting these important studies. 

      We agree that Δ133p53 isoforms exhibit complex biological functions, with both oncogenic and non-oncogenic potentials. However, our mission here was primarily to reveal the molecular mechanism for the dominant-negative effects exerted by the Δ133p53α and Δ160p53α isoforms on FLp53 for which the Δ133p53α and Δ160p53α isoforms are suitable model systems. Exploring the oncogenic potential of the isoforms is beyond the scope of the current study and we have not claimed anywhere that we are reporting that. We have carefully revised the manuscript and replaced the respective terms e.g. ‘prooncogenic activity’ with ‘dominant-negative effect’ in relevant places (e.g. line 90). We have now also added a paragraph with suitable references that introduces the oncogenic and non-oncogenic roles of the p53 isoforms.

      After reviewing the papers you cited, we are not sure that they reflect on oncogenic /non-oncogenic role of the Δ133p53α isoform in different cancer cases.  Although our study is not about the oncogenic potential of the isoforms, we have summarized the key findings below:

      (1) Hofstetter et al., 2011: Demonstrated that Δ133p53α expression improved recurrence-free and overall survival (in a p53 mutant induced advanced serous ovarian cancer, suggesting a potential protective role in this context.

      (2) Bischof et al., 2019: Found that Δ133p53 mRNA can improve overall survival in high-grade serous ovarian cancers. However, out of 31 patients, only 5 belong to the TP53 wild-type group, while the others carry TP53 mutations.

      (3) Knezović et al., 2019: Reported downregulation of Δ133p53 in renal cell carcinoma tissues with wild-type p53 compared to normal adjacent tissue, indicating a potential non-oncogenic role, but not conclusively demonstrating it.

      (4) Gong et al., 2015: Showed that Δ133p53 antagonizes p53-mediated apoptosis and promotes DNA double-strand break repair by upregulating RAD51, LIG4, and RAD52 independently of FLp53.

      (5) Gong et al., 2016: Demonstrated that overexpression of Δ133p53 promotes efficiency of cell reprogramming by its anti-apoptotic function and promoting DNA DSB repair. The authors hypotheses that this mechanism is involved in increasing RAD51 foci formation and decrease γH2AX foci formation and chromosome aberrations in induced pluripotent stem (iPS) cells, independent of FL p53.

      (6) Horikawa et al., 2017: Indicated that induced pluripotent stem cells derived from fibroblasts that overexpress Δ133p53 formed noncancerous tumors in mice compared to induced pluripotent stem cells derived from fibroblasts with complete p53 inhibition. Thus, Δ133p53 overexpression is "non- or less oncogenic and mutagenic" compared to complete p53 inhibition, but it still compromises certain p53-mediated tumor-suppressing pathways. “Overexpressed Δ133p53 prevented FL-p53 from binding to the regulatory regions of p21WAF1 and miR-34a promoters, providing a mechanistic basis for its dominant-negative

      inhibition of a subset of p53 target genes.”

      (7) Gong, 2016: Suggested that Δ133p53 promotes cell survival under lowlevel oxidative stress, but its role under different stress conditions remains uncertain.

      We have revised the Introduction to provide a more balanced discussion of Δ133p53’s dule role (lines 62-73):

      “The Δ133p53 isoform exhibit complex biological functions, with both oncogenic and non-oncogenic potentials. Recent studies demonstrate the non-oncogenic yet context-dependent role of the Δ133p53 isoform in cancer development. Δ133p53 expression has been reported to correlate with improved survival in patients with TP53 mutations(23, 24), where it promotes cell survival in a nononcogenic manner(25, 26), especially under low oxidative stress(27). Alternatively, other recent evidences emphasize the notable oncogenic functions of Δ133p53 as it can inhibit p53-dependent apoptosis by directly interacting with the FLp53 (4, 6). The oncogenic function of the newly identified Δ160p53 isoform is less known, although it is associated with p53 mutation-driven tumorigenesis(28) and in melanoma cells’ aggressiveness(10). Whether or not the Δ160p53 isoform also impedes FLp53 function in a similar way as Δ133p53 is an open question. However, these p53 isoforms can certainly compromise p53-mediated tumor suppression by interfering with FLp53 binding to target genes such as p21 and miR-34a(2, 29) by dominant-negative effect, the exact mechanism is not known.” On the figures presented in this manuscript, I have three major concerns:

      (1) Most results in the manuscript rely on the overexpression of the FLAGtagged or V5-tagged isoforms. The validation of these construct entirely depends on Supplementary figure 3 which the authors claim "rules out the possibility that the FLAG epitope might contribute to this aggregation. However, I am not entirely convinced by that conclusion. Indeed, the ratio between the "regular" isoform and the aggregates is much higher in the FLAG-tagged constructs than in the V5-tagged constructs. We can visualize the aggregates easily in the FLAG-tagged experiment, but the imaging clearly had to be overexposed (given the white coloring demonstrating saturation of the main bands) to visualize them in the V5-tagged experiments. Therefore, I am not convinced that an effect of the FLAG-tag can be ruled out and more convincing data should be added. 

      Thank you for raising this important concern. We have carefully considered your comments and have made several revisions to clarify and strengthen our conclusions.

      First, to address the potential influence of the FLAG and V5 tags on p53 isoform aggregation, we have revised Figure 2 and removed the previous Supplementary Figure 3, where non-specific antibody bindings and higher molecular weight aggregates were not clearly interpretable. In the revised Figure 2, we have removed these potential aggregates, improving the clarity and accuracy of the data.

      To further rule out any tag-related artifacts, we conducted a coimmunoprecipitation assay with FLAG-tagged FLp53 and untagged Δ133p53 and Δ160p53 isoforms. The results (now shown in the new Supplementary Figure 3) completely agree with our previous result with FLAG-tagged and V5tagged Δ133p53 and Δ160p53 isoforms and show interaction between the partners. This indicates that the FLAG / V5-tags do not influence / interfere with the interaction between FLp53 and the isoforms. We have still used FLAGtagged FLp53 as the endogenous p53 was undetectable and the FLAG-tagged FLp53 did not aggregate alone. 

      In the revised paper, we added the following sentences (Lines 146-152): “To rule out the possibility that the observed interactions between FLp53 and its isoforms Δ133p53 and Δ160p53 were artifacts caused by the FLAG and V5 antibody epitope tags, we co-expressed FLAG-tagged FLp53 with untagged Δ133p53 and Δ160p53. Immunoprecipitation assays demonstrated that FLAGtagged FLp53 could indeed interact with the untagged Δ133p53 and Δ160p53 isoforms (Supplementary Figure 3, lanes 3 and 4), confirming formation of hetero-oligomers between FLp53 and its isoforms. These findings demonstrate that Δ133p53 and Δ160p53 can oligomerize with FLp53 and with each other.”

      Additionally, we performed subcellular fractionation experiments to compare the aggregation and localization of FLAG-tagged FLp53 when co-expressed either with V5-tagged or untagged Δ133p53/Δ160p53. In these experiments, the untagged isoforms also induced FLp53 aggregation, mirroring our previous results with the tagged isoforms (Supplementary Figure 5). We’ve added this result in the revised manuscript (lines 236-245): “To exclude the possibility that FLAG or V5 tags contribute to protein aggregation, we also conducted subcellular fractionation of H1299 cells expressing FLAG-tagged FLp53 along with untagged Δ133p53 or Δ160p53 at a 1:5 ratio. The results showed (Supplementary Figure 6) a similar distribution of FLp53 across cytoplasmic, nuclear, and insoluble nuclear fractions as in the case of tagged Δ133p53 or Δ160p53 (Figure 6A to D). Notably, the aggregation of untagged Δ133p53 or Δ160p53 markedly promoted the aggregation of FLAG-tagged FLp53 (Supplementary Figure 6B and D), demonstrating that the antibody epitope tags themselves do not contribute to protein aggregation.” 

      We’ve also discussed this in the Discussion section (lines 349-356): “In our study, we primarily utilized an overexpression strategy involving FLAG/V5tagged proteins to investigate the effects of p53 isoforms Δ133p53 and Δ160p53 on the function of FLp53. To address concerns regarding potential overexpression artifacts, we performed the co-immunoprecipitation (Supplementary Figure 6) and caspase-3 and -7 activity (Figure 7) experiments with untagged Δ133p53 and Δ160p53. In both experimental systems, the untagged proteins behaved very similarly to the FLAG/V5 antibody epitopecontaining proteins (Figures 6 and 7 and Supplementary Figure 6). Hence, the C-terminal tagging of FLp53 or its isoforms does not alter the biochemical and physiological functions of these proteins.”

      In summary, the revised data set and newly added experiments provide strong evidence that neither the FLAG nor the V5 tag contributes to the observed p53 isoform aggregation.

      (2) The authors demonstrate that to visualize the dominant-negative effect, Δ133p53α and Δ160p53α must be "present in a higher proportion than FLp53 in the tetramer" and the need at least a transfection ratio 1:5 since the 1:1 ration shows no effect. However, in almost every single cell type, FLp53 is far more expressed than the isoforms which make it very unlikely to reach such stoichiometry in physiological conditions and make me wonder if this mechanism naturally occurs at endogenous level. This limitation should be at least discussed.

      Thank you for your insightful comment. However, evidence suggests that the expression levels of these isoforms such as Δ133p53, can be significantly elevated relative to FLp53 in certain physiological conditions(3, 4, 9). For example, in some breast tumors, with Δ133p53 mRNA is expressed at a much levels than FLp53, suggesting a distinct expression profile of p53 isoforms compared to normal breast tissue(4). Similarly, in non-small cell lung cancer and the A549 lung cancer cell line, the expression level of Δ133p53 transcript is significantly elevated compared to non-cancerous cells(3). Moreover, in specific cholangiocarcinoma cell lines, the Δ133p53 /TAp53 expression ratio has been reported to increase to as high as 3:1(9). These observations indicate that the dominant-negative effect of isoform Δ133p53 on FLp53 can occur under certain pathological conditions where the relative amounts of the FLp53 and the isoforms would largely vary. Since data on the Δ160p53 isoform are scarce, we infer that the long N-terminal truncated isoforms may share a similar mechanism.

      (3) Figure 5C: I am concerned by the subcellular location of the Δ133p53α and Δ160p53α as they are commonly considered nuclear and not cytoplasmic as shown here, particularly since they retain the 3 nuclear localization sequences like the FLp53 (Bourdon JC et al. 2005; Mondal A et al. 2018; Horikawa I et al, 2017; Joruiz S. et al, 2024). However, Δ133p53α can form cytoplasmic speckles (Horikawa I et al, 2017) when it colocalizes with autophagy markers for its degradation.

      The authors should discuss this issue. Could this discrepancy be due to the high overexpression level of these isoforms? A co-staining with autophagy markers (p62, LC3B) would rule out (or confirm) activation of autophagy due to the overwhelming expression of the isoform.

      Thank you for your thoughtful comments. We have thoroughly reviewed all the papers you recommended (Bourdon JC et al., 2005; Mondal A et al., 2018; Horikawa I et al., 2017; Joruiz S. et al., 2024)(4, 29, 30, 31). Among these, only the study by Bourdon JC et al. (2005) provided data regarding the localization of Δ133p53(4). Interestingly, their findings align with our observations, indicating that the protein does not exhibit predominantly nuclear localization in the Figure 8 from Jean-Christophe Bourdon et al. Genes Dev. 2005;19:2122-2137. The discrepancy may be caused by a potentially confusing statement in that paper(4).

      The localization of p53 is governed by multiple factors, including its nuclear import and export(32). The isoforms Δ133p53 and Δ160p53 contain three nuclear localization sequences (NLS)(4). However, the isoforms Δ133p53 and Δ160p53 were potentially trapped in the cytoplasm by aggregation and masking the NLS. This mechanism would prevent nuclear import. 

      Further, we acknowledge that Δ133p53 co-aggregates with autophagy substrate p62/SQSTM1 and autophagosome component LC3B in cytoplasm by autophagic degradation during replicative senescence(33). We agree that high overexpression of these aggregation-prone proteins may induce endoplasmic reticulum (ER) stress and activates autophagy(34). This could explain the cytoplasmic localization in our experiments. However, it is also critical to consider that we observed aggregates in both the cytoplasm and the nucleus (Figures 6B and E and Supplementary Figure 6B). While cytoplasmic localization may involve autophagy-related mechanisms, the nuclear aggregates likely arise from intrinsic isoform properties, such as altered protein folding, independent of autophagy. These dual localizations reflect the complex behavior of Δ133p53 and Δ160p53 isoforms under our experimental conditions.

      In the revised manuscript, we discussed this in Discussion (lines 328-335): “Moreover, the observed cytoplasmic isoform aggregates may reflect autophagy-related degradation, as suggested by the co-localization of Δ133p53 with autophagy substrate p62/SQSTM1 and autophagosome component LC3B(33). High overexpression of these aggregation-prone proteins could induce endoplasmic reticulum stress and activate autophagy(34). Interestingly, we also observed nuclear aggregation of these isoforms (Figure 6B and E and Supplementary Figure 6B), suggesting that distinct mechanisms, such as intrinsic properties of the isoforms, may govern their localization and behavior within the nucleus. This dual localization underscores the complexity of Δ133p53 and Δ160p53 behavior in cellular systems.”

      Minor concerns:

      -  Figure 1A: the initiation of the "Δ140p53" is shown instead of "Δ40p53"

      Thank you! The revised Figure 1A has been created in the revised paper.

      -  Figure 2A: I would like to see the images cropped a bit higher, so the cut does not happen just above the aggregate bands

      Thank you for this suggestion. We’ve changed the image and the new Figure 2 has been shown in the revised paper.

      -  Figure 3C: what ratio of FLp53/Delta isoform was used?

      We have added the ratio in the figure legend of Figure 3C (lines 845-846) “Relative DNA-binding of the FLp53-FLAG protein to the p53-target gene promoters in the presence of the V5-tagged protein Δ133p53 or Δ160p53 at a 1: 1 ratio.”

      -  Figure 3C suggests that the "dominant-negative" effect is mostly senescencespecific as it does not affect apoptosis target genes, which is consistent with Horikawa et al, 2017 and Gong et al, 2016 cited above. Furthermore, since these two references and the others from Gong et al. show that Δ133p53α increases DNA repair genes, it would be interesting to look at RAD51, RAD52 or Lig4, and maybe also induce stress.

      Thank you for your thoughtful comments and suggestions. In Figure 3C, the presence of Δ133p53 or Δ160p53 only significantly reduced the binding of FLp53 to the p21 promoter. However, isoforms Δ133p53 and Δ160p53 demonstrated a significant loss of DNA-binding activity at all four promoters: p21, MDM2, and apoptosis target genes BAX and PUMA (Figure 3B). This result suggests that Δ133p53 and Δ160p53 have the potential to influence FLp53 function due to their ability to form hetero-oligomers with FLp53 or their intrinsic tendency to aggregate. To further investigate this, we increased the isoform to FLp53 ratio in Figure 4, which demonstrate that the isoforms Δ133p53 and Δ160p53 exert dominant-negative effects on the function of FLp53. 

      These results demonstrate that the isoforms can compromise p53-mediated pathways, consistent with Horikawa et al. (2017), which showed that Δ133p53α overexpression is "non- or less oncogenic and mutagenic" compared to complete p53 inhibition, but still affects specific tumor-suppressing pathways. Furthermore, as noted by Gong et al. (2016), Δ133p53’s anti-apoptotic function under certain conditions is independent of FLp53 and unrelated to its dominantnegative effects.

      We appreciate your suggestion to investigate DNA repair genes such as RAD51, RAD52, or Lig4, especially under stress conditions. While these targets are intriguing and relevant, we believe that our current investigation of p53 targets in this manuscript sufficiently supports our conclusions regarding the dominant-negative effect. Further exploration of additional p53 target genes, including those involved in DNA repair, will be an important focus of our future studies.

      - Figure 5A and B: directly comparing the level of FLp53 expressed in cytoplasm or nucleus to the level of Δ133p53α and Δ160p53α expressed in cytoplasm or nucleus does not mean much since these are overexpressed proteins and therefore depend on the level of expression. The authors should rather compare the ratio of cytoplasmic/nuclear FLp53 to the ratio of cytoplasmic/nuclear Δ133p53α and Δ160p53α.

      Thank you very much for this valuable suggestion. In the revised paper, Figure 5B has been recreated.  Changes have been made in lines 214215: “The cytoplasm-to-nucleus ratio of Δ133p53 and Δ160p53 was approximately 1.5-fold higher than that of FLp53 (Figure 5B).” 

      Referees cross-commenting

      I agree that the system needs to be improved to be more physiological.

      Just to precise, the D133 and D160 isoforms are not truncated mutants, they are naturally occurring isoforms expressed in almost every normal human cell type from an internal promoter within the TP53 gene.

      Using overexpression always raises concerns, but in this case, I am even more careful because the isoforms are almost always less expressed than the FLp53, and here they have to push it 5 to 10 times more expressed than the FLp53 to see the effect which make me fear an artifact effect due to the overwhelming overexpression (which even seems to change the normal localization of the protein).

      To visualize the endogenous proteins, they will have to change cell line as the H1299 they used are p53 null.

      Thank you for these comments. We’ve addressed the motivation of overexpression in the above responses. We needed to use the plasmid constructs in the p53-null cells to detect the proteins but the expression level was certainly not ‘overwhelmingly high’. 

      First, we tried the A549 cells (p53 wild-type) under DNA damage conditions, but the endogenous p53 protein was undetectable. Second, several studies reported increased Δ133p53 level compared to wild-type p53 and that it has implications in tumor development(2, 3, 4, 9). Third, the apoptosis activity of H1299 cells overexpressing p53 proteins was analyzed in the revised manuscript (Figure 7). The apoptotic activity induced by FLp53 expression was approximately 2.5 times higher than that of the control vector under identical plasmid DNA transfection conditions (Figure 7). These results rule out the possibility that the plasmid-based expression of p53 and its isoforms introduced artifacts in the results. We’ve discussed this in the Results section (lines 254269).

      Reviewer #3 (Significance):

      Overall, the paper is interesting particularly considering the range of techniques used which is the main strength.

      The main limitation to me is the lack of contradictory discussion as all argumentation presents Δ133p53α and Δ160p53α exclusively as oncogenic and strictly FLp53 dominant-negative when, particularly for Δ133p53α, a quite extensive literature suggests a not so clear-cut activity.

      The aggregation mechanism is reported for the first time for Δ133p53α and Δ160p53α, although it was already published for Δ40p53α, Δ133p53β or in mutant p53.

      This manuscript would be a good basic research addition to the p53 field to provide insight in the mechanism for some activities of some p53 isoforms.

      My field of expertise is the p53 isoforms which I have been working on for 11 years in cancer and neuro-degenerative diseases

      Thank you very much for your positive and critical comments. We’ve included a fair discussion on the oncogenic and non-oncogenic function of Δ133p53 in the Introduction following your suggestion (lines 62-73). 

      References

      (1) Pitolli C, Wang Y, Candi E, Shi Y, Melino G, Amelio I. p53-Mediated Tumor Suppression: DNA-Damage Response and Alternative Mechanisms. Cancers 11,  (2019).

      (2) Fujita K, et al. p53 isoforms Delta133p53 and p53beta are endogenous regulators of replicative cellular senescence. Nature cell biology 11, 1135-1142 (2009).

      (3) Fragou A, et al. Increased Δ133p53 mRNA in lung carcinoma corresponds with reduction of p21 expression. Molecular medicine reports 15, 1455-1460 (2017).

      (4) Bourdon JC, et al. p53 isoforms can regulate p53 transcriptional activity. Genes & development 19, 2122-2137 (2005).

      (5) Ghosh A, Stewart D, Matlashewski G. Regulation of human p53 activity and cell localization by alternative splicing. Molecular and cellular biology 24, 7987-7997 (2004).

      (6) Aoubala M, et al. p53 directly transactivates Δ133p53α, regulating cell fate outcome in response to DNA damage. Cell death and differentiation 18, 248-258 (2011).

      (7) Marcel V, et al. p53 regulates the transcription of its Delta133p53 isoform through specific response elements contained within the TP53 P2 internal promoter. Oncogene 29, 2691-2700 (2010).

      (8) Zhao L, Sanyal S. p53 Isoforms as Cancer Biomarkers and Therapeutic Targets. Cancers 14,  (2022).

      (9) Nutthasirikul N, Limpaiboon T, Leelayuwat C, Patrakitkomjorn S, Jearanaikoon P. Ratio disruption of the ∆133p53 and TAp53 isoform equilibrium correlates with poor clinical outcome in intrahepatic cholangiocarcinoma. International journal of oncology 42, 1181-1188 (2013).

      (10) Tadijan A, et al. Altered Expression of Shorter p53 Family Isoforms Can Impact Melanoma Aggressiveness. Cancers 13,  (2021).

      (11) Aubrey BJ, Kelly GL, Janic A, Herold MJ, Strasser A. How does p53 induce apoptosis and how does this relate to p53-mediated tumour suppression? Cell death and differentiation 25, 104-113 (2018).

      (12) Ghorbani N, Yaghubi R, Davoodi J, Pahlavan S. How does caspases regulation play role in cell decisions? apoptosis and beyond. Molecular and cellular biochemistry 479, 1599-1613 (2024).

      (13) Petronilho EC, et al. Oncogenic p53 triggers amyloid aggregation of p63 and p73 liquid droplets. Communications chemistry 7, 207 (2024).

      (14) Forget KJ, Tremblay G, Roucou X. p53 Aggregates penetrate cells and induce the coaggregation of intracellular p53. PloS one 8, e69242 (2013).

      (15) Farmer KM, Ghag G, Puangmalai N, Montalbano M, Bhatt N, Kayed R. P53 aggregation, interactions with tau, and impaired DNA damage response in Alzheimer's disease. Acta neuropathologica communications 8, 132 (2020).

      (16) Arsic N, et al. Δ133p53β isoform pro-invasive activity is regulated through an aggregation-dependent mechanism in cancer cells. Nature communications 12, 5463 (2021).

      (17) Melo Dos Santos N, et al. Loss of the p53 transactivation domain results in high amyloid aggregation of the Δ40p53 isoform in endometrial carcinoma cells. The Journal of biological chemistry 294, 9430-9439 (2019).

      (18) Mestrom L, et al. Artificial Fusion of mCherry Enhances Trehalose Transferase Solubility and Stability. Applied and environmental microbiology 85,  (2019).

      (19) Kaba SA, Nene V, Musoke AJ, Vlak JM, van Oers MM. Fusion to green fluorescent protein improves expression levels of Theileria parva sporozoite surface antigen p67 in insect cells. Parasitology 125, 497-505 (2002).

      (20) Snapp EL, et al. Formation of stacked ER cisternae by low affinity protein interactions. The Journal of cell biology 163, 257-269 (2003).

      (21) Jain RK, Joyce PB, Molinete M, Halban PA, Gorr SU. Oligomerization of green fluorescent protein in the secretory pathway of endocrine cells. The Biochemical journal 360, 645-649 (2001).

      (22) Campbell RE, et al. A monomeric red fluorescent protein. Proceedings of the National Academy of Sciences of the United States of America 99, 7877-7882 (2002).

      (23) Hofstetter G, et al. Δ133p53 is an independent prognostic marker in p53 mutant advanced serous ovarian cancer. British journal of cancer 105, 1593-1599 (2011).

      (24) Bischof K, et al. Influence of p53 Isoform Expression on Survival in High-Grade Serous Ovarian Cancers. Scientific reports 9, 5244 (2019).

      (25) Gong L, et al. p53 isoform Δ113p53/Δ133p53 promotes DNA double-strand break repair to protect cell from death and senescence in response to DNA damage. Cell research 25, 351-369 (2015).

      (26) Gong L, et al. p53 isoform Δ133p53 promotes efficiency of induced pluripotent stem cells and ensures genomic integrity during reprogramming. Scientific reports 6, 37281 (2016).

      (27) Gong L, Pan X, Yuan ZM, Peng J, Chen J. p53 coordinates with Δ133p53 isoform to promote cell survival under low-level oxidative stress. Journal of molecular cell biology 8, 88-90 (2016).

      (28) Candeias MM, Hagiwara M, Matsuda M. Cancer-specific mutations in p53 induce the translation of Δ160p53 promoting tumorigenesis. EMBO reports 17, 1542-1551 (2016).

      (29) Horikawa I, et al. Δ133p53 represses p53-inducible senescence genes and enhances the generation of human induced pluripotent stem cells. Cell death and differentiation 24, 1017-1028 (2017).

      (30) Mondal AM, et al. Δ133p53α, a natural p53 isoform, contributes to conditional reprogramming and long-term proliferation of primary epithelial cells. Cell death & disease 9, 750 (2018).

      (31) Joruiz SM, Von Muhlinen N, Horikawa I, Gilbert MR, Harris CC. Distinct functions of wild-type and R273H mutant Δ133p53α differentially regulate glioblastoma aggressiveness and therapy-induced senescence. Cell death & disease 15, 454 (2024).

      (32) O'Brate A, Giannakakou P. The importance of p53 location: nuclear or cytoplasmic zip code? Drug resistance updates : reviews and commentaries in antimicrobial and anticancer chemotherapy 6, 313-322 (2003).

      (33) Horikawa I, et al. Autophagic degradation of the inhibitory p53 isoform Δ133p53α as a regulatory mechanism for p53-mediated senescence. Nature communications 5, 4706 (2014).

      (34) Lee H, et al. IRE1 plays an essential role in ER stress-mediated aggregation of mutant huntingtin via the inhibition of autophagy flux. Human molecular genetics 21, 101-114 (2012).

    1. Author Response

      Reviewer #1 (Public Review):

      Weaknesses:

      1) The authors should better review what we know of fungal Drosophila microbiota species as well as the ecology of rotting fruit. Are the microbiota species described in this article specific to their location/setting? It would have been interesting to know if similar species can be retrieved in other locations using other decaying fruits. The term 'core' in the title suggests that these species are generally found associated with Drosophila but this is not demonstrated. The paper is written in a way that implies the microbiota members they have found are universal. What is the evidence for this? Have the fungal species described in this paper been found in other studies? Even if this is not the case, the paper is interesting, but there should be a discussion of how generalizable the findings are.

      The reviewer inquires as to whether the microbial species described in this article are ubiquitously associated with Drosophila or not. Indeed, most of the microbes described in this manuscript are generally recognized as species associated with Drosophila spp. For example, species such as Hanseniaspora uvarum, Pichia kluyveri, and Starmerella bacillaris have been detected in or isolated from Drosophila spp. collected in European countries as well as the United States and Oceania (Chandler et al., 2012; Solomon et al., 2019). As for the bacteria, species belonging to the genera Pantoea, Lactobacillus, Leuconostoc, and Acetobacter have also previously been detected in wild Drosophila spp. (Chandler et al., 2011). These elucidations will be incorporated into our revised manuscript.

      Nevertheless, the term “core” in the manuscript title may lead to misunderstanding, as the generality does not ensure the ubiquitous presence of these microbial species in every individual fly. Considering this point, we will replace the term with an expression more appropriate to our context.

      2) Can the authors clearly demonstrate that the microbiota species that develop in the banana trap are derived from flies? Are these species found in flies in the wild? Did the authors check that the flies belong to the D. melanogaster species and not to the sister group D. simulans?

      Can the authors clearly demonstrate that the microbiota species that develop in the banana trap are derived from flies? Are these species found in flies in the wild?

      The reviewer asked whether the microbial species identified in the fermented banana samples were derived from flies. To address this question, additional experiments under more controlled conditions, such as the inoculation of specific species of wild flies onto fresh bananas, would be needed. Nevertheless, the microbes may potentially originate from wild flies, as supported by the literature cited in our response to the Weakness 1).

      Alternative sources for microbial provenance also merit consideration. For example, microbial entities may be inherently present in unfermented bananas through the infiltration of peel injuries (lines 1141-1142 of the original manuscript). In addition, they could be introduced by insects other than flies, given that both rove beetles (Staphylinidae) and sap beetles (Nitidulidae) were observed in some of the traps. These possibilities will be incorporated into the 'MATERIALS AND METHODS' and 'DISCUSSION' sections of our revised manuscript.

      Did the authors check that the flies belong to the D. melanogaster species and not to the sister group D. simulans?

      Our sampling strategy was designed to target not only D. melanogaster but also other domestic Drosophila species, such as D. simulans, that inhabit human residential areas. After adult flies were caught in each trap, we identified the species as shown in Table S1, thereby showing the presence of either or both D. melanogaster and D. simulans. We will provide these descriptions in MATERIALS AND METHODS and DISCUSSION.

      3) Did the microarrays highlight a change in immune genes (ex. antibacterial peptide genes)? Whatever the answer, this would be worth mentioning. The authors described their microarray data in terms of fed/starved in relation to the Finke article. They should clarify if they observed significant differences between species (differences between species within bacteria or fungi, and more generally differences between bacteria versus fungi).

      Did the microarrays highlight a change in immune genes (ex. antibacterial peptide genes)? Whatever the answer, this would be worth mentioning.

      Regarding the antimicrobial peptide genes, statistical comparisons of our RNA-seq data across different conditions were impracticable because most of them showed low expression levels (refer to Author response table 1, which exhibits the RNA-seq data of the yeast-fed larvae; similar expression profiles were observed in the bacteria-fed larvae). While a subset of genes exhibited significantly elevated expression in the non-supportive conditions relative to the supportive ones, this can be due to intra-sample variability rather than due to distinct nutritional environments. Therefore, it would be difficult to discuss a change in immune genes in the paper. Additionally, the previous study that conducted larval microarray analysis (Zinke et al., 2002) did not explicitly focus on immune genes.

      Author response table 1.

      Antimicrobial peptide genes are not up-regulated by any of the microbes. Antimicrobial peptides gene expression profiles of whole bodies of first-instar larvae fed on yeasts. TPM values of all samples and comparison results of gene expression levels in the larvae fed on supportive and non-supportive yeasts are shown. Antibacterial peptide genes mentioned in Hanson and Lemaitre, 2020 are listed. NA or na, not available.

      They should clarify if they observed significant differences between species (differences between species within bacteria or fungi, and more generally differences between bacteria versus fungi).

      We did not observe significant differences between species within bacteria or fungi, or between bacteria and fungi. For example, the gene expression profiles of larvae fed on the various supporting microbes showed striking similarities to each other, as evidenced by the heat map showing the expression of all genes detected in larvae fed either yeast or bacteria (Author response image 1). Similarities were also observed among larvae fed on distinct non-supporting microbes.

      Author response image 1.

      Gene expression profiles of larvae fed on the various supporting microbes show striking similarities to each other. Heat map showing the gene expression of the first-instar larvae that fed on yeasts or bacteria. Freshly hatched germ-free larvae were placed on banana agar inoculated with each microbe and collected after 15 h feeding to examine gene expression of the whole body. Note that data presented in Figures 3A and 4C in the original manuscript, which are obtained independently, are combined to generate this heat map. The labels under the heat map indicate the microbial species fed to the larvae, with three samples analyzed for each condition. The lactic acid bacteria (“LAB”) include Lactiplantibacillus plantarum and Leuconostoc mesenteroides, while the lactic acid bacterium (“AAB”) represents Acetobacter orientalis. “LAB + AAB” signifies mixtures of the AAB and either one of the LAB species. The asterisk in the label highlights a sample in a “LAB” condition (Leuconostoc mesenteroides), which clustered separately from the other “LAB” samples. Brown abbreviations of scientific names are for the yeast-fed conditions. H. uva, Hanseniaspora uvarum; K. hum, Kazachstania humilis; M. asi, Martiniozyma asiatica; S. cra, Saccharomycopsis crataegensis; P. klu, Pichia kluyveri; S. bac, Starmerella bacillaris; S. cer, S. cerevisiae BY4741 strain.

      Only a handful of genes showed different expression patterns between larvae fed on yeast and those fed on bacteria, without any enrichment for specialized gene functions. Thus, it is challenging to discuss the potential differential impacts, if any, of yeast and bacteria on larval growth.

      4) The whole paper - and this is one of its merits - points to a role of the Drosophila larval microbiota in processing the fly food. Are these bacterial and fungal species found in the gut of larvae/adults? Are these species capable of establishing a niche in the cardia of adults as shown recently in the Ludington lab (Dodge et al.,)? Previous studies have suggested that microbiota members stimulate the Imd pathway leading to an increase in digestive proteases (Erkosar/Leulier). Are the microbiota species studied here affecting gut signaling pathways beyond providing branched amino acids?

      The whole paper - and this is one of its merits - points to a role of the Drosophila larval microbiota in processing the fly food. Are these bacterial and fungal species found in the gut of larvae/adults? Are these species capable of establishing a niche in the cardia of adults as shown recently in the Ludington lab (Dodge et al.,)?

      Although we did not investigate the microbiota in the gut of either larvae or adults, we did compare the microbiota within surface-sterilized larvae or adults with those in food samples. We found that adult flies and early-stage food sources, as well as larvae and late-stage food sources, harbor similar microbial species (Figure 1F). Additionally, previous examinations of the gut microbiota in wild adult flies have identified microbial species or taxa congruent with those we isolated from our foods (Chandler et al., 2011; Chandler et al., 2012). We have elaborated on this in our response to Weakness 1).

      While we did not investigate whether these species are capable of establishing a niche in the cardia of adults, we will cite the study by Dodge et al., 2023 in our revised manuscript and discuss the possibility that predominant microbes in adult flies may show a propensity for colonization.

      Previous studies have suggested that microbiota members stimulate the Imd pathway leading to an increase in digestive proteases (Erkosar/Leulier). Are the microbiota species studied here affecting gut signaling pathways beyond providing branched amino acids?

      The reviewer inquires whether the supportive microbes in our study stimulate gut Imd signaling pathways and induce the expression of digestive protease genes, as demonstrated in a previous study (Erkosar et al., 2015). According to our RNA-seq data, it seems unlikely that the supportive microbes stimulate the signaling pathway. Figures contained in Author response image 2 provide the statistical comparisons of expression levels for seven protease genes between the supportive and the non-supportive conditions. These genes did not exhibit a consistent upregulation in the presence of the supportive microbes (H. uva or K. hum in Author response image 2A; Le mes + A. ori in Author response image 2B). Rather, they exhibited a tendency to be upregulated under the non-supportive microbes (St. bac or Pi. klu in Author response image 2A; La. pla in Author response image 2B).

      Author response image 2.

      Most of the peptidase genes reported by Erkosar et al., 2015 are more highly expressed under the non-supportive conditions than the supportive conditions. Comparison of the expression levels of seven peptidase genes derived from the RNA-seq analysis of yeast-fed (A) or bacteria-fed (B) first-instar larvae. A previous report demonstrated that the expression of these genes is upregulated upon association with a strain of Lactiplantibacillus plantarum, and that the PGRP-LE/Imd/Relish signaling pathway, at least partially, mediates the induction (Erkosar et al., 2015). H. uva, Hanseniaspora uvarum; K. hum, Kazachstania humilis; P. klu, Pichia kluyveri; S. bac, Starmerella bacillaris; La. pla, Lactiplantibacillus plantarum; Le. mes, Leuconostoc mesenteroides; A. ori, Acetobacter orientalis; ns, not significant.

      Reviewer #2 (Public Review):

      Weaknesses:

      The experimental setting that, the authors think, reflects host-microbe interactions in nature is one of the key points. However, it is not explicitly mentioned whether isolated microbes are indeed colonized in wild larvae of Drosophila melanogaster who eat bananas. Another matter is that this work is rather descriptive and a few mechanical insights are presented. The evidence that the nutritional role of BCAAs is incomplete, and molecular level explanation is missing in "interspecies interactions" between lactic acid bacteria (or yeast) and acetic acid bacteria that assure their inhabitation. Apart from these matters, the future directions or significance of this work could be discussed more in the manuscript.

      The experimental setting that, the authors think, reflects host-microbe interactions in nature is one of the key points. However, it is not explicitly mentioned whether isolated microbes are indeed colonized in wild larvae of Drosophila melanogaster who eat bananas.

      The reviewer asks whether the isolated microbes were colonized in the larval gut. Previous studies on microbial colonization associated with Drosophila have predominantly focused on adults (Pais et al. PLOS Biology, 2018), rather than larval stages. Developing larvae continually consume substrates which are already subjected to microbial fermentation and abundant in live microbes until the end of the feeding larval stage. Therefore, we consider it difficult to discuss microbial colonization in the larval gut. We will add this point in the DISCUSSION of the revised manuscript.

      Another matter is that this work is rather descriptive and a few mechanical insights are presented. The evidence that the nutritional role of BCAAs is incomplete, and molecular level explanation is missing in "interspecies interactions" between lactic acid bacteria (or yeast) and acetic acid bacteria that assure their inhabitation.

      While recognizing the importance of comprehensive mechanistic analysis, this study includes all experimentally feasible data. Elucidation of more detailed molecular mechanisms lies beyond the scope of this study and will be the subject of future research.

      Regarding the nutritional role of BCAAs, the incorporation of BCAAs enabled larvae fed with the non-supportive yeast to grow to the second instar. This observation suggests that consumption of BCAAs upregulates diverse genes involved in cellular growth processes in larvae. We have discussed the hypothetical interaction between lactic acid bacteria (LAB) and acetic acid bacteria (AAB) in the manuscript (lines 402-405): LAB may facilitate lactate provision to AAB, consequently enhancing the biosynthesis of essential nutrients such as amino acids. To test this hypothesis, future experiments will include the supplementation of lactic acid to AAB culture plates and the co-inoculating LAB mutant strains defective in lactate production with AABs, to assess both larval growth and continuous larval association with AABs. With respect to AAB-yeast interactions, metabolites released from yeast cells might benefit AAB growth, and this possibility will be investigated through the supplementation of AAB culture plates with candidate metabolites identified in the cell suspension supernatants of the late-stage yeasts.

      Apart from these matters, the future directions or significance of this work could be discussed more in the manuscript.

      We appreciate the reviewer's recommendations and will include additional descriptions regarding these aspects in the DISCUSSION section.

      Reviewer #3 (Public Review):

      Weaknesses:

      Despite describing important findings, I believe that a more thorough explanation of the experimental setup and the steps expected to occur in the exposed diet over time, starting with natural "inoculation" could help the reader, in particular the non-specialist, grasp the rationale and main findings of the manuscript. When exactly was the decision to collect early-stage samples made? Was it when embryos were detected in some of the samples? What are the implications of bacterial presence in the no-fly traps? These samples also harbored complex microbial communities, as revealed by sequencing. Were these samples colonized by microbes deposited with air currents? Were they the result of flies that touched the material but did not lay eggs? Could the traps have been visited by other insects? Another interesting observation that could be better discussed is the fact that adult flies showed a microbiome that more closely resembles that of the early-stage diet, whereas larvae have a more late-stage-like microbiome. It is easy to understand why the microbiome of the larvae would resemble that of the late-stage foods, but what about the adult microbiome? Authors should discuss or at least acknowledge the fact that there must be a microbiome shift once adults leave their food source. Lastly, the authors should provide more details about the metabolomics experiments. For instance, how were peaks assigned to leucine/isoleucine (as well as other compounds)? Were both retention times and MS2 spectra always used? Were standard curves produced? Were internal, deuterated controls used?

      When exactly was the decision to collect early-stage samples made? Was it when embryos were detected in some of the samples?

      We collected traps and early-stage samples 2.5 days after setting up the traps. This time frame was determined by pilot experiments. A shorter collection time resulted in a greater likelihood of obtaining no-fly traps, whereas a longer collection time caused larval overcrowding, as well as adults’ deaths from drowning in the liquid seeping out of fruits. These procedural details will be delineated in the MATERIALS AND METHODS section of the revised manuscript.

      What are the implications of bacterial presence in the no-fly traps? These samples also harbored complex microbial communities, as revealed by sequencing. Were these samples colonized by microbes deposited with air currents? Were they the result of flies that touched the material but did not lay eggs? Could the traps have been visited by other insects?

      We assume that the origins of the microbes detected in the no-fly trap foods vary depending on the species. For instance, Colletotrichum musae, the fungus that causes banana anthracnose, may have been present in fresh bananas before trap placement. The filamentous fungi could have originated from airborne spores, but they could also have been introduced by insects that feed on these fungi. We will include these possibilities in the DISCUSSION section of the revised manuscript.

      Another interesting observation that could be better discussed is the fact that adult flies showed a microbiome that more closely resembles that of the early-stage diet, whereas larvae have a more late-stage-like microbiome. It is easy to understand why the microbiome of the larvae would resemble that of the late-stage foods, but what about the adult microbiome? Authors should discuss or at least acknowledge the fact that there must be a microbiome shift once adults leave their food source.

      We are grateful for the reviewer's insightful suggestions regarding shifts in the adult microbiome. We plan to include in the DISCUSSION section of the revised manuscript the possibility that the microbial composition may change substantially during pupal stages and that microbes obtained after eclosion could potentially form the adult gut microbiota.

      Lastly, the authors should provide more details about the metabolomics experiments. For instance, how were peaks assigned to leucine/isoleucine (as well as other compounds)? Were both retention times and MS2 spectra always used? Were standard curves produced? Were internal, deuterated controls used?

      We appreciate the reviewer's advice. Detailed methods of the metabolomic experiments will be included in our revised manuscript.

    1. Author response:

      Reviewer #1 (Public review):

      (1) Some details are not described for experimental procedures. For example, what were the pharmacological drugs dissolved in, and what vehicle control was used in experiments? How long were pharmacological drugs added to cells?

      We apologise for the oversight. These details have now been added to the methods section of the manuscript as well as to the relevant figure legends.

      Briefly, latrunculin was used at a final concentration of 250 nM and Y27632 at a final concentration of 50 μM. Both drugs were dissolved in DMSO. The vehicle controls were effected with the highest final concentration of DMSO of the two drugs.

      The details of the drug treatments and their duration was added to the methods and to figures 6, S10, and S12.

      (2) Details are missing from the Methods section and Figure captions about the number of biological and technical replicates performed for experiments. Figure 1C states the data are from 12 beads on 7 cells. Are those same 12 beads used in Figure 2C? If so, that information is missing from the Figure 2C caption. Similarly, this information should be provided in every figure caption so the reader can assess the rigor of the experiments. Furthermore, how heterogenous would the bead displacements be across different cells? The low number of beads and cells assessed makes this information difficult to determine.

      We apologise for the oversight. We have now added this data to the relevant figure panels.

      To gain a further understanding of the heterogeneity of bead displacements across cells, we have replotted the relevant graphs using different colours to indicate different cells. This reveals that different cells appear to behave similarly and that the behaviour appears controlled by distance to the indentation or the pipette tip rather than cell identity.

      We agree with the reviewer that the number of cells examined is low. This is due to the challenging nature of the experiments that signifies that many attempts are necessary to obtain a successful measurement.

      The experiments in Fig 1C are a verification of a behaviour documented in a previous publication [1]. Here, we just confirm the same behaviour and therefore we decided that only a small number of cells was needed.

      The experiments in Fig 2C (that allow for a direct estimation of the cytoplasm’s hydraulic permeability) require formation of a tight seal between the glass micropipette and the cell, something known as a gigaseal in electrophysiology. The success rate of this first step is 10-30% of attempts for an experienced experimenter. The second step is forming a whole cell configuration, in which a hydraulic link is formed between the cell and the micropipette. This step has a success rate of ~ 50%. Whole cell links are very sensitive to any disturbance. After reaching the whole cell configuration, we applied relatively high pressures that occasionally resulted in loss of link between the cell and the micropipette. In summary, for the 12 successful measurements, hundreds of unsuccessful attempts were carried out.

      (3) The full equation for displacement vs. time for a poroelastic material is not provided. Scaling laws are shown, but the full equation derived from the stress response of an elastic solid and viscous fluid is not shown or described.

      We thank the reviewer for this comment. Based on our experiments, we found that the cytoplasm behaves as a poroelastic material. However, to understand the displacements of the cell surface in response to localised indentation, we show that we also need to take the tension of the sub membranous cortex into account. In summary, the interplay between cell surface tension generated by the cortex and the poroelastic cytoplasm controls the cell behaviour. To our knowledge, no simple analytical solutions to this type of problem exist.

      In Fig 1, we show that the response of the cell to local indentation is biphasic with a short time-scale displacement followed by a longer time-scale one. In Figs 2 and 3, we directly characterise the kinetics of cell surface displacement in response to microinjection of fluid. These kinetics are consistent with the long time-scale displacement but not the short time-scale one. Scaling considerations led us to propose that tension in the cortex may play a role in mediating the short time-scale displacement. To verify this hypothesis, we have now added new data showing that the length-scale of an indentation created by an AFM probe depends on tension in the cortex (Fig S5).

      In a previous publication [2], we derived the temporal dynamics of cell surface displacement for a homogenous poroelastic material in response to a change in osmolarity. In the current manuscript, the composite nature of the cell (membrane, cortex, cytoplasm) needs to be taken into account as well as a realistic cell shape. Therefore, we did not attempt to provide an analytical solution for the displacement of the cell surface versus time in the current work. Instead, we turned to finite element modelling to show that our observations are qualitatively consistent with a cell that comprises a tensed sub membranous actin cortex and a poroelastic cytoplasm (Fig 4). We have now added text to make this clearer for the reader.

      Reviewer #2 (Public review):

      Comments & Questions:

      The authors state, "Next, we sought to quantitatively understand how the global cellular response to local indentation might arise from cellular poroelasticity." However, the evidence presented in the following paragraph appears more qualitative than strictly quantitative. For instance, the length scale estimate of ~7 μm is only qualitatively consistent with the observed ~10 μm, and the timescale 𝜏𝑧 ≈ 500 ms is similarly described as "qualitatively consistent" with experimental observations. Strengthening this point would benefit from more direct evidence linking the short timescale to cell surface tension. Have you tried perturbing surface tension and examining its impact on this short-timescale relaxation by modulating acto-myosin contractility with Y-27632, depolymerizing actin with Latrunculin, or applying hypo/hyperosmotic shocks?

      Upon rereading our manuscript, we agree with the reviewer that some of our statements are too strong. We have now moderated these and clarified the goal of that section of the text.

      The reviewer asks if we have examined the effect of various perturbations on the short time-scale displacements. In our experimental conditions, we cannot precisely measure the time-scale of the fast relaxation because its duration is comparable to the frame rate of our image acquisition. However, we examined the amplitude of the displacement of the first phase in response to sucrose treatment and we have carried out new experiments in which we treat cells with 250nM Latrunculin to partially depolymerise cellular F-actin. Neither of these treatments had an impact on the amplitude of vertical displacements (Author response image 1).

      The absence of change in response to Latrunculin may be because the treatment decreases both the elasticity of the cytoplasm E and the cortical tension γ. As the length-scale l of the deformation of the surface scales as , the two effects of latrunculin treatment may therefore compensate one another and result in only small changes in l. We have now added this data to supplementary information and comment on this in the text.

      Author response image 1:

      Amplitude of the short time-scale displacements of beads in response to AFM indentation at δx=0µm for control cells, sucrose treated cells, and cells treated with Latrunculin B. n indicates the number of cells examined and N the number of beads.

      The reviewer’s comment also made us want to determine how cortical tension affects the length-scale of the cell surface deformation created by localised micro indentation. To isolate the role of the cortex from that of cell shape, we decided to examine rounded mitotic cells. In our experiments, we indented a mitotic cell expressing a membrane targeted GFP with a sharp AFM tip (Author response image 2).

      In our experiments, we adjusted force to generate a 2μm depth indentation and we imaged the cell profile with confocal microscopy before and during indentation. Segmentation of this data allowed us to determine the cell surface displacement resulting from indentation and measure a length scale of deformation. In control conditions, the length scale created by deformation is on the order of 1.2μm. When we inhibited myosin contractility with blebbistatin, the length-scale of deformation decreased significantly to 0.8 μm, as expected if we decrease the surface tension γ without affecting the cytoplasmic elasticity. We have now added this data to our manuscript.

      Author response image 2.

      (a) Overlay of the zx profiles of a mitotic cell before (green) and during indentation (red). The cell membrane is labelled with CellMask DeepRed. The arrowhead indicates the position of the AFM tip. Scale bar 10µm. (b) Position of the membrane along the top half of the cell before (green) and during (red) indentation. The membrane position is derived from segmentation of the data in (a). Deformation is highly localised and membrane profiles overlap at the edges. The tip position is marked by an *. (c) The difference in membrane height between pre-indentation and indentation profiles plotted in (b) with the tip located at x=0. (d) Schematic of the cell surface profile during indentation and the corresponding length scale of the deformation induced by indentation. (e) Measured length scale for an indentation ~2µm for DMSO control l=1.2±0.2µm (n=8 cells) and with blebbistatin treatment (100µM) l=0.8±0.4µm (n=9 cells) (p= 0.016

      The authors demonstrate that the second relaxation timescale increases (Figure 1, Panel D) following a hyperosmotic shock, consistent with cytoplasmic matrix shrinkage, increased friction, and consequently a longer relaxation timescale. While this result aligns with expectations, is a seven-fold increase in the relaxation timescale realistic based on quantitative estimates given the extent of volume loss?

      We thank the reviewer for this interesting question. Upon re-examining our data, we realised that the numerical values in the text related to the average rather than the median of our measurements. The median of the poroelastic time constant increases from ~0.4s in control conditions to 1.4s in sucrose, representing approximately a 3.5-fold increase.

      Previous work showed that HeLa cell volume decreases by ~40% in response to hyperosmotic shock [3]. The fluid volume fraction in cells is ~65-75%. If we assume that the water is contained in N pores of volume , we can express the cell volume as with V<sub>s</sub> the volume of the solid fraction. We can rewrite with ϕ = 0.42 -0.6. As V<sub>s</sub> does not change in response to osmotic shock, we can rewrite the volume change to obtain the change in pore size .

      The poroelastic diffusion constant scales as and the poroelastic timescale scales as . Therefore, the measured change in volume leads to a predicted increase in poroelastic diffusion time of 1.7-1.9-fold, smaller than observed in our experiments. This suggests that some intuition can be gained in a straightforward manner assuming that the cytoplasm is a homogenous porous material.

      However, the reality is more complex and the hydraulic pore size is distinct from the entanglement length of the cytoskeleton mesh, as we discussed in a previous publication [4]. When the fluid fraction becomes sufficiently small, macromolecular crowding will impact diffusion further and non-linearities will arise. We have now added some of these considerations to the discussion.

      If the authors' hypothesis is correct, an essential physiological parameter for the cytoplasm could be the permeability k and how it is modulated by perturbations, such as volume loss or gain. Have you explored whether the data supports the expected square dependency of permeability on hydraulic pore size, as predicted by simple homogeneity assumptions?

      We thank the reviewer for this comment. As discussed above, we have explored such considerations in a previous publication (see discussion in [4]). Briefly, we find that the entanglement length of the F-actin cytoskeleton does play a role in controlling the hydraulic pore size but is distinct from it. Membrane bounded organelles could also contribute to setting the pore size. In our previous publication, we derived a scaling relationship that indicates that four different length-scales contribute to setting cellular rheology: the average filament bundle length, the size distribution of particles in the cytosol, the entanglement length of the cytoskeleton, and the hydraulic pore size. Many of these length-scales can be dynamically controlled by the cell, which gives rise to complex rheology. We have now added these considerations to our discussion.

      Additionally, do you think that the observed decrease in k in mitotic cells compared to interphase cells is significant? I would have expected the opposite naively as mitotic cells tend to swell by 10-20 percent due to the mitotic overshoot at mitotic entry (see Son Journal of Cell Biology 2015 or Zlotek Journal of Cell Biology 2015).

      We thank the reviewer for this interesting question. Based on the same scaling arguments as above, we would expect that a 10-20% increase in cell volume would give rise to 10-20% increase in diffusion constant. However, we also note that metaphase leads to a dramatic reorganisation of the cell interior and in particular membrane-bounded organelles. In summary, we do not know why such a decrease could take place. We now highlight this as an interesting question for further research.

      Based on your results, can you estimate the pore size of the poroelastic cytoplasmic matrix? Is this estimate realistic? I wonder whether this pore size might define a threshold above which the diffusion of freely diffusing species is significantly reduced. Is your estimate consistent with nanobead diffusion experiments reported in the literature? Do you have any insights into the polymer structures that define this pore size? For example, have you investigated whether depolymerizing actin or other cytoskeletal components significantly alters the relaxation timescale?

      We thank the reviewer for this comment. We cannot directly estimate the hydraulic pore size from the measurements performed in the manuscript. Indeed, while we understand the general scaling laws, the pre-factors of such relationships are unknown.

      We carried out experiments aiming at estimating the hydraulic pore size in previous publications [3,4] and others have shown spatial heterogeneity of the cytoplasmic pore size [5]. In our previous experiments, we examined the diffusion of PEGylated quantum dots (14nm in hydrodynamic radius). In isosmotic conditions, these diffused freely through the cell but when the cell volume was decreased by a hyperosmotic shock, they no longer moved [3,4]. This gave an estimate of the pore radius of ~15nm.

      Previous work has suggested that F-actin plays a role in dictating this pore size but microtubules and intermediate filaments do not [4].

      There are no quantifications in Figure 6, nor is there a direct comparison with the model. Based on your model, would you expect the velocity of bleb growth to vary depending on the distance of the bleb from the pipette due to the local depressurization? Specifically, do blebs closer to the pipette grow more slowly?

      We apologise for the oversight. The quantifications are presented in Fig S10 and Fig S12. We have now modified the figure legends accordingly.

      Blebs are very heterogenous in size and growth velocity within a cell and across cells in the population in normal conditions [6]. Other work has shown that bleb size is controlled by a competition between pressure driving growth and actin polymerisation arresting it[7]. Therefore, we did not attempt to determine the impact of depressurisation on bleb growth velocity or size.

      In experiments in which we suddenly increased pressure in blebbing cells, we did notice a change in the rate of growth of blebs that occurred after we increased pressure (Author response image 3). However, the experiments are technically challenging and we decided not to perform more.

      Author response image 3:

      A. A hydraulic link is established between a blebbing cell and a pipette. At time t>0, a step increase in pressure is applied. B. Kymograph of bleb growth in a control cell (top) an in a cell subjected to a pressure increase at t=0s (bottom). Top: In control blebs, the rate of growth is slow and approximately constant over time. The black arrow shows the start of blebbing. Bottom: The black arrow shows the start of blebbing. The dashed line shows the timing of pressure application and the red arrow shows the increase in growth rate of the bleb when the pressure increase reaches the bleb. This occurs with a delay δt.

      I find it interesting that during depressurization of the interphase cells, there is no observed volume change, whereas in pressurization of metaphase cells, there is a volume increase. I assume this might be a matter of timescale, as the microinjection experiments occur on short timescales, not allowing sufficient time for water to escape the cell. Do you observe the radius of the metaphase cells decreasing later on? This relaxation could potentially be used to characterize the permeability of the cell surface.

      We thank the reviewer for this comment.

      First, we would like to clarify that both metaphase and interphase cells increase their volume in response to microinjection. The effect is easier to quantify in metaphase cells because we assume spherical symmetry and just monitor the evolution of the radius (Fig 3). However, the displacement of the beads in interphase cells (Fig 2) clearly shows that the cell volume increases in response to microinjection. For both interphase and metaphase cells, when the injection is prolonged, the membrane eventually detaches from the cortex and large blebs form until cell lysis. In contrast to the reviewer’s intuition, we never observe a relaxation in cell volume, probably because we inject fluid faster than the cell can compensate volume change through regulatory mechanisms involving ion channels.

      When we depressurise metaphase cells, we do not observe any change in volume (Fig S10). This contrasts with the increase that we observe upon pressurisation. The main difference between these two experiments is the pressure differential. During depressurisation experiments, this is the hydraulic pressure within the cell ~500Pa (Fig 6A); whereas during pressurisation experiments, this is the pressure in the micropipette, ranging from 1.4-10 kPa (Fig 3). We note in particular that, when we used the lowest pressures in our experiments, the increase in volume was very slow (see Fig 3C). Therefore, we agree with the reviewer that it is likely the magnitude of the pressure differential that explains these differences.

      I am curious about the saturation of the time lag at 30 microns from the pipette in Figure 4, Panel E for the model's prediction. A saturation which is not clearly observed in the experimental data. Could you comment on the origin of this saturation and the observed discrepancy with the experiments (Figure E panel 2)? Naively, I would have expected the time lag to scale quadratically with the distance from the pipette, as predicted by a poroelastic model and the diffusion of displacement. It seems weird to me that the beads start to move together at some distance from the pipette or else I would expect that they just stop moving. What model parameters influence this saturation? Does membrane permeability contribute to this saturation?

      We thank the reviewer for pointing this out. In our opinion, the saturation occurring at 30 microns arises from the geometry of the model. At the largest distance away from the micropipette, the cortex becomes dominant in the mechanical response of the cell because it represents an increasing proportion of the cellular material.

      To test this hypothesis, we will rerun our finite element models with a range of cell sizes. This will be added to the manuscript at a later date.

      Reviewer #3 (Public review):

      Weaknesses: I have two broad critical comments:

      (1) I sense that the authors are correct that the best explanation of their results is the passive poroelastic model. Yet, to be thorough, they have to try to explain the experiments with other models and show why their explanation is parsimonious. For example, one potential explanation could be some mechanosensitive mechanism that does not involve cytoplasmic flow; another could be viscoelastic cytoskeletal mesh, again not involving poroelasticity. I can imagine more possibilities. Basically, be more thorough in the critical evaluation of your results. Besides, discuss the potential effect of significant heterogeneity of the cell.

      We thank the reviewer for these comments and we agree with their general premise.

      Some observations could qualitatively be explained in other ways. For example, if we considered the cell as a viscoelastic material, we could define a time constant with η the viscosity and E the elasticity of the material. The increase in relaxation time with sucrose treatment could then be explained by an increase in viscosity. However, work by others has previously shown that, in the exact same conditions as our experiment, viscoelasticity cannot account for the observations[1]. In its discussion, this study proposed poroelasticity as an alternative mechanism but did not investigate that possibility. This was consistent with our work that showed that the cytoplasm behaves as a poroelastic material and not as a viscoelastic material [4]. Therefore, we decided not to consider viscoelasticity as possibility. We now explain this reasoning better and have added a sentence about a potential role for mechanotransductory processes in the discussion.

      (2) The study is rich in biophysics but a bit light on chemical/genetic perturbations. It could be good to use low levels of chemical inhibitors for, for example, Arp2/3, PI3K, myosin etc, and see the effect and try to interpret it. Another interesting question - how adhesive strength affects the results. A different interesting avenue - one can perturb aquaporins. Etc. At least one perturbation experiment would be good.

      We agree with the reviewer. In our previous studies, we already examined what biological structures affect the poroelastic properties of cells [2,4]. Therefore, the most interesting aspect to examine in our current work would be perturbations to the phenomenon described in Fig 6G and, in particular, to investigate what volume regulation mechanisms enable sustained intracellular pressure gradients. However, these experiments are particularly challenging and with very low throughput. Therefore, we feel that these are out of the scope of the present report and we mention these as promising future directions.

      References:

      (1) Rosenbluth, M. J., Crow, A., Shaevitz, J. W. & Fletcher, D. A. Slow stress propagation in adherent cells. Biophys J 95, 6052-6059 (2008). https://doi.org/10.1529/biophysj.108.139139

      (2) Esteki, M. H. et al. Poroelastic osmoregulation of living cell volume. iScience 24, 103482 (2021). https://doi.org/10.1016/j.isci.2021.103482

      (3) Charras, G. T., Mitchison, T. J. & Mahadevan, L. Animal cell hydraulics. J Cell Sci 122, 3233-3241 (2009). https://doi.org/10.1242/jcs.049262

      (4) Moeendarbary, E. et al. The cytoplasm of living cells behaves as a poroelastic material. Nat Mater 12, 253-261 (2013). https://doi.org/10.1038/nmat3517

      (5) Luby-Phelps, K., Castle, P. E., Taylor, D. L. & Lanni, F. Hindered diffusion of inert tracer particles in the cytoplasm of mouse 3T3 cells. Proc Natl Acad Sci U S A 84, 4910-4913 (1987). https://doi.org/10.1073/pnas.84.14.4910

      (6) Charras, G. T., Coughlin, M., Mitchison, T. J. & Mahadevan, L. Life and times of a cellular bleb. Biophys J 94, 1836-1853 (2008). https://doi.org/10.1529/biophysj.107.113605

      (7) Tinevez, J. Y. et al. Role of cortical tension in bleb growth. Proc Natl Acad Sci U S A 106, 18581-18586 (2009). https://doi.org/10.1073/pnas.0903353106

    1. Author response:

      Reviewer #1 (Evidence, reproducibility and clarity (Required)): 

      Summary: 

      Laura Morano and colleagues have performed a screen to identify compounds that interfere with the formation of TopBP1 condensates. TopBP1 plays a crucial role in the DNA damage response, and specifically the activation of ATR. They found that the GSK-3b inhibitor AZD2858 reduced the formation of TopBP1 condensates and activation of ATR and its downstream target CHK1 in colorectal cancer cell lines treated with the clinically relevant irinotecan active metabolite SN-38. This inhibition of TopBP1 condensates by AZD2858 was independent from its effect on GSK-3b enzymatic activity. Mechanistically, they show that AZD2858 thus can interfere with intra-S-phase checkpoint signaling, resulting in enhanced cytostatic and cytotoxic effects of SN-38 (or SN-38+Fluoracil aka FOLFIRI) in vitro in colorectal carcinoma cell lines. 

      Major comments: 

      Overall the work is rigorous and the main conclusions are convincing. However, they only show the effects of their combination treatments on colorectal cancer cell lines. I'm worried that blocking the formation of TopB1 condensates will also be detrimental in non-transformed cells. Furthermore it is somewhat disappointing that it remains unclear how AZD2858 blocks selfassembly of TopBP1 condensates, although I understand that unraveling this would be complex and somewhat out-of-reach for now. 

      We appreciate your feedback and fully recognize the importance of understanding how AZD2858 blocks the assembly of TopBP1 condensates. While we understand your disappointment, addressing this question remains a key focus for us. Keeping in mind that unravelling such a mechanism in vitro or in vivo is rather challenging, we have consulted an expert who has made efforts to predict the potential docking sites of AZD2858 on TopBP1, which may provide valuable insights for future experimental investigations. Using an AlphaFold model (no crystal or cryo-EM structure available) and looking for suitable pockets or cavities in which AZD2858 could bind, the analyses, though requiring cautious interpretation, suggested that AZD2858 may target the BRCT1 and BRCT8 domains (as shown below, two pockets n°1 and 7 with sufficient volume and surrounded by b-sheets structures like other GSK3 inhibitor) of TopBP1.

      However, these are preliminary results that require further exploration and experimental validation to confirm their significance and mechanistic implications.

      Author response image 1.

      Here are some specific points for improvement: 

      (1) The authors conclude that "These data supports [sic] the feasibility of targeting condensates formed in response to DNA damage to improve chemotherapy-based cancer treatments". To support this conclusion the authors need to show that proliferating non-transformed cells (e.g. primary cell cultures or organoids) can tolerate the combination of AZD2858 + SN-38 (or FOLFIRI) better than colorectal cancer cells. 

      We would like to thank the reviewer for this vital suggestion to prove that this combination is effective on tumor cells and not very toxic on healthy cells. We therefore used a healthy colon cell line (CCD841) and tested the efficacy of each treatment alone (FOLFIRI and AZD2858) as well as the combination FOLFIRI+AZD2858. We compared the results obtained in the CCD841 cell line with those obtained in the HCT116 colorectal cancer cell line. The results presented below show not only that each treatment alone is much less effective on CCD841 lines, but also that the combination is not synergistic.

      Author response image 2.

      Page 19 "This suggests that the combination... arrests the cell cycle before mitosis in a DNAPKsc-dependent manner." I find the remark that this arrest would be DNA-PKcs-dependent too speculative. I suppose that the authors base this claim on reference 55 but if they want to support this claim they need to prove this by adding DNA-PKcs inhibitors to their treated cells. 

      Thank you for your thoughtful comment. We agree with the reviewer that claiming the G2/M arrest is DNA-PKcs-dependent without direct experimental evidence is speculative. While we initially based this hypothesis on reference 55, we acknowledge that further experiments, such as the use of DNA-PKcs inhibitors, would be necessary to robustly support this claim.

      Given that this observation was intended as a potential explanation for the G2/M arrest observed at 6 and 12 hours of treatment with AZD2858 + SN-38 (compared to SN-38 alone), and considering that exploring this pathway is not the primary focus of our study, we have decided to remove this hypothesis from both the figure and the text to avoid any ambiguity.

      We appreciate the reviewer’s input and will consider investigating this pathway in future studies.

      (2) When discussing Figure S5B the authors claim that SN-38 + AZD2858 progressively increases the fractions of BrdU positive cells, but this is not supported by statistical analysis.

      The fractions are still very small, so I would like to see statistics on these data. Alternatively, the authors could take out this conclusion. 

      Thank you for your valuable comment. In response, we have conducted a statistical analysis (Mann-Whitney test) on the data, and the results have been added to Figure S5C for the 6-hour time point and Figure S5D for the 12-hour time point, based on three independent biological replicates. We hope this provides the necessary clarification.

      Minor comments: 

      - Page 5 Materials and methods - Cell culture. Last sentence "Add in what medium you cultured them" looks like an internal review remark and should probably be removed? 

      We apologize for this oversight. The medium has now been specified, and the sentence has been removed.

      - The numbers in all the synergy matrices (in white font) are extremely small and virtually unreadable, and visually distracting. I recommend taking these out altogether. 

      We believe that the reduction in figure quality may be due to the PDF compression, which affected the resolution of the figures. We are happy to provide high-resolution versions of the figures separately for clarity. If the issue persists even with the higher resolution, we will consider removing the numbers, as suggested.

      - The legends of the synergy matrices (for example Fig 1D, 4E, 5, 6) are often extremely small, making it difficult to understand them intuitively. Please enlarge them and label them more clearly, and use larger fonts. In the legend of Figure 5D,E a green matrix indicating % live cells is mentioned but I don't see it. Do they mean the grey matrix? 

      We have enlarged the figure legends and will provide high-resolution versions of the figures to ensure all details are clearly readable. Regarding Figure 5D,E: we acknowledge that the color may appear differently (more green or gray) depending on the display or printer settings. To avoid any confusion, we have corrected the legend to specify that the color in question is khaki, rather than green. Moreover, following suggestions of the reviewer #2, these figures have been respectively moved to Figure S6B and S6C.

      - Figure S2. Perhaps I misunderstand the PML body experiment but the authors seem to use PML body formation to support their idea that AZD2858 blocks TopBP1 condensate formation and not just any condensate formation. However, if this is the case they would need a proper positive control, i.e. an additional experimental condition in which they do see PLM bodies. 

      Arsenic is a well-known positive control for experiments involving PML bodies due to its ability to induce specific responses in PML proteins and modify PML nuclear bodies (NBs) structure and function (Jaffray et al., 2023, JCB ; Zhu et al., 1997, PNAS). Thus, we used Arsenic as a positive control and observed a significant increase in PML NBs vs the other conditions (Kruskal-Wallis test) as indicated below. We thus implemented the results in the corresponding figure S2B and text.

      Author response image 3.

      PML condensates were tested after 2 h of incubation. AZD2858 : 100nM ; SN-38 : 300nM ; Arsenic : 6µM. ****: p<0.0001 (Kruskal-Wallis test).

      - The quantification of the flow cytometry data needs to be clarified. I find it strange that in the figures (for example Figure 3A and 3C) representative examples are shown of apparently 3 replicates, and that the percentages shown in these examples are then the given in the text as the overall numbers; for example on page 18 "...BrdU incorporation increased from 16.11% (SN38 alone) to 41.83% (combination)...". This type of description is done in multiple places in the Results section and is confusing. It would be clearer if the authors show proper quantifications (mean +/- sem) of the percentages of (the relevant) gated populations. Besides, I don't think it make a lot of sense to mention in the text the percentages with 2 decimals behind the comma. This suggests a level of precision that does not seem justified in flow cytometry data. Finally, all flow cytometry plots look visually very busy and all the text is crammed in with really small fonts. Cleaning them up and enlarging the fonts of the remaining text/numbers would really improve the readability of the figures. 

      Thank you for your helpful comments. We understand your concern regarding the flow cytometry quantification. Indeed, the percentages presented in the figures are derived from representative replicates, and we acknowledge that this presentation could be confusing. To address this, we have included a table summarizing the data from all replicates to improve readability [Table S2 and S3 in the new version]. Second, we specified in the text that the data are representative biological replicates when needed. Third, we have performed statistical analyses on the three replicates when necessary, as shown in Supplementary Figure S5C-F in the new version. The text has been revised to reflect the correct statistical interpretation.

      Regarding the use of two decimal, we are unable to remove them due to limitations in the software (Kaluza) used for flow cytometry analysis. However, we agree that this level of precision may not be warranted, and we have revised the text where appropriate to reduce confusion.

      - In Figure 5G the authors show that FOLFIRI + AZD2858 are synergistic in two SN-38-resistant cell lines. They conclude that this combination may overcome drug resistance. But tried to figure out the used FOLFIRI concentrations used in these cell lines and they still seem far higher than the SN-38-sensitive HCT116 cell lines, so I would like to see a bit more nuance in their interpretation. I think overcoming drug resistance is an overstatement, and perhaps alleviating would be a better term 

      Thank you for highlighting this important point; we have adjusted the text accordingly.

      - The legend in Table S2 refers to Figure 5A-B; this should be Figure 4A-B. 

      Thank you, this has been corrected and Table S2 is now moved to Table S4 .

      Reviewer #1 (Significance (Required)): 

      The finding that AZD2858 block TOPbp1 condensate formation via a pleiotropic effect of this compound is interesting and convincing. To my best knowledge it's a novel finding which is interesting to the potential target audience mentioned below. Their findings that inhibition of TOPbp1 condensation and ATR signaling via AZD2858 may synergize with FOLFIRI therapy in colorectal cancer cells are still very preliminary, because the effects on non-cancerous cells are not tested. 

      Researchers involved in early cancer drug discovery and cell biologists studying DNA damage responses in cancer cells seem to me typical audience interested and influenced by this paper. 

      I'm a cell biologist studying cell cycle fate decisions, and adaptation of cancer cells & stem cells to (drug-induced) stress. My expertise aligns well with the work presented throughout this paper. 

      Reviewer #2 (Evidence, reproducibility and clarity (Required)): 

      The authors have extended their previous research to develop TOPBP1 as a potential drug target for colorectal cancer by inhibiting its condensation. Utilizing an optogenetic approach, they identified the small molecule AZD2858, which inhibits TOPBP1 condensation and works synergistically with first-line chemotherapy to suppress colorectal cancer cell growth. The authors investigated the mechanism and discovered that disrupting TOPBP1 assembly inhibits the ATR/Chk1 signaling pathway, leading to increased DNA damage and apoptosis, even in drug-resistant colorectal cancer cell lines. Addressing the following concerns would enhance clarity and further in vivo work may improve significance: 

      (1) How does the optogenetic method for inducing condensates compare to the DNA damage induction mechanism? 

      Optogenetics provides a versatile and precise approach for controlling the condensation of scaffold proteins in both space and time. This method enables us to study the role of biomolecular condensates with minute-scale resolution, separating their formation from potentially confounding upstream events, such as DNA damage, and providing valuable insights into their specific function. Importantly, based on our previous publications on TopBP1 or SLX4 optogenetic condensates, we have substantial evidence indicating that light-induced condensates closely mimic those formed in response to DNA damage:

      - Functional similarity: Optogenetic condensates recapitulate endogenous condensates formed upon exposure of the cells of DNA damaging agents, and include most known partner proteins involved in the DNA damage response. It was shown for light induced-TopBP1 and SLX4 condensates (1-3).

      - Dynamic reversibility: Optogenetic condensates and DNA damage induced condensates are both dynamic and reversible. They dissolve within 15 minutes of light deactivation or after removal of the damaging agent (1,3).

      - Chromatin association: Both optogenetic and DNA damage-induced condensates are bound to chromatin or localized at sites of DNA damage (3).

      - Regulation: Both types of condensates are regulated similarly, with their formation triggered by the same signaling pathways. ATR basal activity drives the nucleation of opto-TopBP1 condensates and endogenous TopBP1 structures upon light exposure (1). Likewise, sumoylation modifications regulate the formation of opto-SLX4 condensates and endogenous SLX4 condensates (3).

      - Structurally: Using super-resolution imaging by stimulation-emission-depletion (STED) microscopy, we observed that endogenous SLX4 nanocondensates formed globular clusters that were indistinguishable from recombinant light induced SLX4 condensates (1,3).  

      (1) Frattini C, Promonet A, Alghoul E, Vidal-Eychenie S, Lamarque M, Blanchard MP, et al. TopBP1 assembles nuclear condensates to switch on ATR signaling. Molecular Cell. 18 mars 2021;81(6):1231-1245.e8. 

      (2) Alghoul E, Basbous J, Constantinou A. An optogenetic proximity labeling approach to probe the composition of inducible biomolecular condensates in cultured cells. STAR Protocols. 2021;2(3):100677. 

      (3) Alghoul E, Basbous J, Constantinou A. Compartmentalization of the DNA damage response: Mechanisms and functions. DNA Repair. août 2023;128:103524.

      (2) Why wasn't the initial screen conducted on the HCT116-SN50 resistant cell line? 

      Thank you for raising this important question, which we also considered at the outset of the project. After careful consideration, we decided to use the HCT116 WT cells in order to obtain initial data from an unmodified cell line. It is worth mentioning that HCT116-SN50 cells exhibit slower proliferation compared to WT cells, and they also express an efflux pump capable of pumping out SN38. We were concerned that these factors might interfere with the optogenetic assay, which is why we chose to perform the screen using the WT HCT116 cells.

      (3) The labels in Fig. 1D are difficult to recognize. 

      This issue was also raised by Reviewer #1. We suspect that the PDF conversion may have reduced the resolution of the figures, so we will provide them separately in high resolution. In addition, we have increased the size of some labels to improve their clarity.

      The selected cell image in Fig. 2A for SN-38 seems over-representative; unselected cells appear similar to other groups. Why does AZD2858 itself induce TopBP1 condensates in the plot, yet this is not evident in the images? 

      Thank you for your comment; we have updated the figure with a more representative image. We indeed observe that AZD2858 alone induces a slight increase in TopBP1 condensates. However, this increase did not lead to the activation of the ATR/Chk1 signaling pathway, as shown by the Western blot data presented in Fig. 2B. In addition, AZD2858 specifically prevents the formation of TopBP1 condensates induced by SN38 treatment, and the level of TopBP1 condensates does not return to the basal levels observed in untreated cells, but rather to those observed with AZD2858 treatment. During the 2-hour AZD2858 treatment, the progression of replication forks was unaffected (Fig. 3A and 3B). However, when AZD2858 was added alone to the Xenopus egg extracts, there was increased recruitment of TopBP1 to the chromatin (Fig. 2E). This result suggests that AZD2858 alone can induce the assembly of TopBP1 on chromatin to initiate DNA replication (a well-established role of TopBP1), but the number and concentration of TopBP1 molecules did not reach levels sufficient to activate the ATR/Chk1 pathway.

      (4) In Fig. 3A, despite the drastic change in the FACS plot shape, the quantifications appear quite similar. 

      Thank you for this insightful observation. The gates for the S phase were intentionally set wider to avoid biasing the results and inadvertently excluding the population that incorporates BrdU weakly (but still incorporates it) in the SN-38 only condition. As a result, the percentage of cells within this gate remains similar, even though the overall shape of the FACS plot changes, reflecting a shift in the distribution of BrdU incorporation. This point has now been clarified in the legend of the Figure 3A.

      This effect can also be attributed to the relatively short treatment time (2 hours), which captures early changes in DNA synthesis. The effect becomes more pronounced at later time points, as shown in Figure 3C. For example, after 6 hours of treatment, the percentage of BrdU-positive cells increases from 15% with SN-38 alone to 41% with the AZD2858 combination, demonstrating a clearer impact on DNA synthesis. A graph summarizing the statistical analysis has been added to Figure S5C for the 6-hour time point and Figure S5D for the 12-hour time point, based on data from three independent biological replicates.

      (5) The results section is imbalanced; Figs. 5 and 6 could be combined into one figure. 

      We have combined Figures 5 and 6 into a single figure to optimize the presentation of results. To avoid overloading the new figure, some of the data have been moved to supplementary figures, ensuring the main figure remains clear and focused.

      (6) An in vivo study is anticipated to assess the drug's efficacy. 

      Although AZD2858 was developed a few years ago, there is a limited amount of in vivo data available, which led us to consider potential issues related to the drug's biodistribution or its pharmacokinetics (PK). Despite these concerns, we proceeded with preliminary in vivo studies, testing various diluents and injection routes for AZD2858. However, we observed that the compound was not effective in vivo. Given the strong synergistic effects observed in vitro, we concluded that AZD2858 was likely not being distributed properly in the mice. As a result, we have decided to conduct a more detailed investigation into the pharmacokinetics (PK), pharmacodynamics (PD), and absorption, distribution, metabolism, and excretion (ADME) of AZD2858 to better understand its in vivo behavior and efficacy. Therefore, the in vivo evaluation of AZD2858 will be addressed in a separate study specifically focused on this aspect.

      Reviewer #2 (Significance (Required)): 

      Addressing the stated concerns would enhance clarity and further in vivo work may improve significance. 

      Reviewer #3 (Evidence, reproducibility and clarity (Required)): 

      Summary 

      In 2021 (PMID: 33503405) and 2024 (PMID: 38578830) Constantinou and colleagues published two elegant papers in which they demonstrated that the Topbp1 checkpoint adaptor protein could assemble into mesoscale phase-separated condensates that were essential to amplify activation of the PIKK, ATR, and its downstream effector kinase, Chk1, during DNA damage signalling. A key tool that made these studies possible was the use of a chimeric Topbp1 protein bearing a cryptochrome domain, Cry2, which triggered condensation of the chimeric Topbp1 protein, and thus activation of ATR and Chk1, in response to irradiation with blue light without the myriad complications associated with actually exposing cells to DNA damage. 

      In this current report Morano and co-workers utilise the same optogenetic Topbp1 system to investigate a different question, namely whether Topbp1 phase-condensation can be inhibited pharmacologically to manipulate downstream ATR-Chk1 signalling. This is of interest, as the therapeutic potential of the ATR-Chk1 pathway is an area of active investigation, albeit generally using more conventional kinase inhibitor approaches. 

      The starting point is a high throughput screen of 4730 existing or candidate small molecule anticancer drugs for compounds capable of inhibiting the condensation of the Topbp1-Cry2mCherry reporter molecule in vivo. A surprisingly large number of putative hits (>300) were recorded, from which 131 of the most potent were selected for secondary screening using activation of Chk1 in response to DNA damage induced by SN-38, a topoisomerase inhibitor, as a surrogate marker for Topbp1 condensation. From this the 10 most potent compounds were tested for interactions with a clinically used combination of SN-38 and 5-FU (FOLFIRI) in terms of cytotoxicity in HCT116 cells. The compound that synergised most potently with FOLFIRI, the GSK3-beta inhibitor drug AZD2858, was selected for all subsequent experiments. 

      AZD2858 is shown to suppress the formation of Topbp1 (endogenous) condensates in cells exposed to SN-38, and to inhibit activation of Chk1 without interfering with activation of ATM or other endpoints of damage signalling such as formation of gamma-H2AX or activation of Chk2 (generally considered to be downstream of ATM). AZD2858 therefore seems to selectively inhibit the Topbp1-ATR-Chk1 pathway without interfering with parallel branches of the DNA damage signalling system, consistent with Topbp1 condensation being the primary target. Importantly, neither siRNA depletion of GSK3-beta, or other GSK3-beta inhibitors were able to recapitulate this effect, suggesting it was a specific non-canonical effect of AZD2858 and not a consequence of GSK3-beta inhibition per se. 

      To understand the basis for synergism between AZD2858 and SN-38 in terms of cell killing, the effect of AZD2858 on the replication checkpoint was assessed. This is a response, mediated via ATR-Chk1, that modulates replication origin firing and fork progression in S-phase cell under conditions of DNA damage or when replication is impeded. SN-38 treatment of HCT116 cells markedly suppresses DNA replication, however this was partially reversed by co-treatment with AZD2858, consistent with the failure to activate ATR-Chk1 conferring a defect in replication checkpoint function. 

      Figures 4 and 5 demonstrate that AZD2858 can markedly enhance the cytotoxic and cytostatic effects of SN-38 and FOLFIRI through a combination of increased apoptosis and growth arrest according to dosage and treatment conditions. Figure 6 extends this analysis to cells cultured as spheroids, sometimes considered to better represent tumor responses compared to single cell cultures. 

      Major comments 

      Most of the data presented is of good technical quality and supports the conclusions drawn. There are however a small number of instances where this is not true; ie where the data are of insufficient technical quality, or where the description or interpretation of the results is at variance with the data which is presented. Some examples: 

      (1) Fig.2E - the claim that "we observed an increase in RPA, Topb1 and Pol-epsilon levels when CPT and AZD2858 were added together" do not seem to be justified by the data provided. It is also unclear what the purpose/ significance of this experiment is. 

      Thank you for pointing out the contradiction in Figure 2E. Upon review, we identified an error in the labeling of conditions (CPT and AZD2858 were inadvertently swapped). The corrected figure now clearly shows that, at the 60-minute timepoint after starting replication, the combination of

      CPT and AZD2858 results in a greater accumulation of TopBP1, Pol ε, and RPA on chromatin compared to CPT alone. We have revised the sentence to: "Our data demonstrate that combining CPT and AZD2858 earlier enhances the accumulation of replication-related factors (RPA, TopBP1, and Pol ε) on chromatin compared to CPT treatment alone, particularly visible at the 60minute after starting replication."

      The significance of this experiment lies in its connection to the earlier observation that AZD2858 restores BrdU incorporation when combined with SN-38, as shown in flow cytometry data (Figure 3A). At a molecular level, this was further supported by DNA fiber assays, which revealed that replication tracks (CldU tracts) were longer in the combination treatment compared to SN-38 alone (Figure 3B).

      To strengthen and validate these findings, we chose to employ the Xenopus egg extract system for several reasons. This model provides a highly controlled environment where DNA replication occurs without confounding effects from transcription or translation. Moreover, replication is limited to a single round, offering a unique opportunity to specifically interrogate replication mechanisms. These attributes make the Xenopus model an ideal system to confirm that AZD2858 facilitates replication recovery in the presence of replication stress induced by agents like CPT. This will lead, in longer treatment, to accumulation of DNA damage and apoptosis (Figure 3D-E and Figure 4A-D)

      (2) Figs. 3 A and C certainly show that the SN-38-mediated suppression of DNA synthesis is modified and partially alleviated by co-treatment with AZD2858. The statement however that "prolonged co-incubation with AZD2858 for 6 and 12 hours effectively abolished the SN-38 induced S-phase checkpoint" is clearly misleading. If this were true, then the BrdU incorporation profiles of the respective samples would be similar or identical to control, which clearly they are not. Clearly AZD2858 is affecting the imposition of the S-phase checkpoint in some way, but not "abolishing" it. 

      We appreciate the reviewer’s detailed observations regarding Figures 3A and 3C and the phrasing in our manuscript. We agree that the term "abolished" is not precise in describing the effects of AZD2858 on the SN-38-induced S-phase checkpoint.

      To clarify: our data indicate that co-treatment with AZD2858 modifies and partially alleviates the SN-38-induced suppression of DNA synthesis, as demonstrated by increased BrdU incorporation relative to SN-38 treatment alone. However, as the reviewer correctly points out, the BrdU incorporation profiles of the co-treated samples do not fully return to control non treated cells levels. This suggests that while AZD2858 significantly mitigates the S-phase checkpoint, it does not completely abolish it.

      We have revised the statement in the manuscript to better reflect these findings, as follows: "Prolonged co-incubation with AZD2858 for 6 and 12 hours significantly alleviated the SN-38induced S-phase checkpoint, as evidenced by the partially increased BrdU incorporation. However, the population of co-treated cells is heterogeneous: some cells exhibit BrdU incorporation levels similar to those of untreated control cells, while others incorporate BrdU at levels comparable to cells treated with SN-38 alone. This indicates that AZD2858 does not fully restore DNA synthesis to control levels across the entire cell population."

      This revised phrasing aligns with the data presented and acknowledges the partial recovery of DNA synthesis observed. Thank you for bringing this to our attention and helping us improve the accuracy of our conclusions.

      (3) Fig. 3 E. The western blots of pDNA-PKcs (S2056) and total DNA-PKcs are really not interpretable. It is possible to sympathise that these reagents are probably extremely difficult to work with and obtain clear results, however uninterpretable results are not acceptable. 

      We agree that the data presented in the Fig3E are difficult to interpret. As noted by Reviewer 1, we recognize the challenge of obtaining clear and reliable results with these specific reagents. Based on this feedback, and to ensure the robustness of our conclusions, we have decided to exclude these specifics blots from the revised manuscript.

      We believe that this adjustment will enhance the clarity and reliability of the manuscript while focusing on the other, more interpretable data presented. Thank you for pointing this out, and we appreciate your understanding.

      (4) Fig. 3D. This is a puzzling image. Described as a PFGE assay, it presumably depicts an agarose gel, with intact genomic DNA at the top and a discrete band below representing fragmented genomic DNA. This is a little surprising, as fragmented genomic DNA does not usually appear as a specific band but as a heterogenous population or "smear". Nevertheless, even if one accepts this premise, it is unclear what is meant by "DSBs remained elevated after the combined treatment" when the intensity of this band is equivalent for both SN-38 and SN-38 + AZD2858 treatments. 

      We thank the reviewer for his insightful comments regarding the PFGE results in Figure 3D. We agree that the appearance of a discrete band, rather than a heterogeneous smear, is atypical for fragmented genomic DNA in this assay. However, by enhancing the signal intensity (as shown below), the expected smear becomes more appreciable.

      Author response image 4.

      Regarding the interpretation of the band intensities, we agree that the signals for SN-38 and SN38 + AZD2858 appear similar under these specific conditions. At the relatively high concentration of SN-38 used in this experiment (300 nM), it is indeed challenging to observe a more pronounced effect on DNA breaks. This is why we proposed the "DSBs remained elevated after the combined treatment" because the band intensity of SN-38 single agent treated cells or combined with AZD2858 is comparable. However, we note a slightly more intense γH2AX signal over time when AZD2858 is combined with SN-38 compared to SN-38 alone (Figure 3E). Furthermore, under lower, sub-optimal doses of SN-38 and over extended incubation treatment (48h), we observe a clearer increase in fragmented DNA bands, as demonstrated in Figure 4D.

      Minor comments 

      (1) Fig. 1. A surprisingly large number of compounds scored positive in the primary screen for inhibition of Topbp1 condensation (>300). Of the 131 of these selected for secondary screening using Chk1 activation (S345 phosphorylation) as a readout approximately 2/3 were negative, implying that a majority of the tested compounds inhibited Topbp1 condensation but not Chk1 activation. What could explain that?

      Thank you for this thoughtful comment. The discrepancy between the large number of compounds scoring positive for TopBP1 condensation inhibition and the smaller number inhibiting Chk1 activation (S345 phosphorylation) could be attributed to several factors:

      • Different cell lines and induction methods: The initial screen was conducted in HEK293 TrexFlpin cells overexpressing optoTopBP1, while the secondary screen used HCT116 cells. In addition, the methods used to induce the respective pathways were distinct: in the primary screen, we employed a blue light induction of opto-TopBP1 condensates, whereas in the secondary screen, we used an SN-38 treatment to induce DNA replication stress and activate the Chk1 pathway. These differences could account for the varying responses observed in the two screens.

      • The compounds that inhibited TopBP1 condensation might not fully block Chk1 activation. While they disrupt TopBP1 condensation, they may still allow for partial activation of Chk1 or Chk1 activation through alternative mechanisms. For instance, Chk1 activation could be mediated by other signaling pathways or molecules, such as ETAA1, a known Chk1 activator (1). Thus, TopBP1 condensation inhibition does not necessarily translate to complete inhibition of Chk1 activation, especially if ETAA1 is employed by cells as a rescue activator.

      • Some compounds may affect chromosome dynamics, potentially generating mechanical forces or torsional stress that could activate the ATR/Chk1 pathway independently of TopBP1

      (2).

      These factors suggest that while the compounds effectively disrupt TopBP1 condensation, they may not always fully inhibit the downstream Chk1 activation, pointing to the complexity of the DNA damage response pathways. 

      (1) Bass, T. E. et al. ETAA1 acts at stalled replication forks to maintain genome integrity. Nat Cell Biol 18, 1185–1195 (2016).

      (2) Kumar, A. et al. ATR Mediates a Checkpoint at the Nuclear Envelope in Response to Mechanical Stress. Cell 158, 633–646 (2014).

      (2) Fig. 2D. The protein-protein interaction assay shown demonstrates that AZD2858 ablates the light-induced auto-interaction between exogenous opto-Topbp1 molecules and ATR plus or minus SN-38, but clearly endogenous Topbp1 molecules do not participate. Why is this? 

      The biotin proximity labeling assay was conducted without exposing cells to light, using a TurboID module fused to TopBP1-mCherry-CRY2. Stable cell lines were then generated in HEK293 TrexFlpIn cells, where endogenous TopBP1 is still expressed. Upon adding doxycycline, the recombinant TurboID-TopBP1-mCherry-Cry2 (opto-TopBP1) is induced at levels comparable to endogenous TopBP1 (Fig 2D).

      Since the opto-TopBP1 construct exhibits behavior similar to that of endogenous TopBP1 (1), we used it to investigate whether TopBP1 self-assembly and its interaction with ATR are influenced by AZD2858 alone or in combination with SN38. Our results show that treatment with SN38 increases the proximity between opto-TopBP1 and the endogenous TopBP1 (not fused to TurboID). However, AZD2858, either alone or in combination with SN38, disrupts the selfassembly of recombinant TopBP1 with itself as well as its interaction with endogenous TopBP1.

      (1) Frattini C, Promonet A, Alghoul E, Vidal-Eychenie S, Lamarque M, Blanchard MP, et al. TopBP1 assembles nuclear condensates to switch on ATR signaling. Molecular Cell. 18 mars 2021;81(6):1231-1245.e8.

      Reviewer #3 (Significance (Required)): 

      Significance 

      Liquid phase separation of protein complexes is increasingly recognised as a fundamental mechanism in signal transduction and other cellular processes. One recent and important example was that of Topbp1, whose condensation in response to DNA damage is required for efficient activation of the ATR-Chk1 pathway. The current study asks a related but distinct question; can protein condensation be targeted by drugs to manipulate signalling pathways which in the main rely on protein kinase cascades? 

      Here, the authors identify an inhibitor of GSK3-beta as a novel inhibitor of DNA damage-induced Topbp1 condensation and thus of ATR-Chk1 signalling. 

      This work will be of interest to researchers in the fields of DNA damage signalling, biophysics of protein condensation, and cancer chemotherapy.

    1. Author response:

      Public Reviews:

      Reviewer #1 (Public review):

      In this paper by Brickwedde et al., the authors observe an increase in posterior alpha when anticipating auditory as opposed to visual targets. The authors also observe an enhancement in both visual and auditory steady-state sensory evoked potentials in anticipation of auditory targets, in correlation with enhanced occipital alpha. The authors conclude that alpha does not reflect inhibition of early sensory processing, but rather orchestrates signal transmission to later stages of the sensory processing stream. However, there are several major concerns that need to be addressed in order to draw this conclusion.

      First, I am not convinced that the frequency tagging method and the associated analyses are adequate for dissociating visual vs auditory steady-state sensory evoked potentials.

      Second, if the authors want to propose a general revision for the function of alpha, it would be important to show that alpha effects in the visual cortex for visual perception are analogous to alpha effects in the auditory cortex for auditory perception.

      Third, the authors propose an alternative function for alpha - that alpha orchestrates signal transmission to later stages of the sensory processing stream. However, the supporting evidence for this alternative function is lacking. I will elaborate on these major concerns below.

      (1) Potential bleed-over across frequencies in the spectral domain is a major concern for all of the results in this paper. The fact that alpha power, 36Hz and 40Hz frequency-tagged amplitude and 4Hz intermodulation frequency power is generally correlated with one another amplifies this concern. The authors are attaching specific meaning to each of these frequencies, but perhaps there is simply a broadband increase in neural activity when anticipating an auditory target compared to a visual target?

      We appreciate the reviewer’s insightful comment regarding the potential bleed-over across frequencies in the spectral domain. We fully acknowledge that the trade-off between temporal and frequency resolution is a challenge, particularly given the proximity of the frequencies we are examining.

      To address this concern, we performed additional analyses to investigate whether there is indeed a broadband increase in neural activity when anticipating an auditory target as compared to a visual target, as opposed to distinct frequency-specific effects. Our results show that the bleed-over between frequencies is minimal and does not significantly affect our findings. Specifically, we repeated the analyses using the same filter and processing steps for the 44 Hz frequency. At this frequency, we did not observe any significant differences between conditions.

      These findings suggest that the effects we report are indeed specific to the 40 Hz frequency band and not due to a general broadband increase in neural activity. We hope this addresses the reviewer’s concern and strengthens the validity of our frequency-specific results.

      Author response image 1.

      Illustration of bleeding over effects over a span of 4 Hz. A, 40 Hz frequency-tagging data over the significant cluster differing between when expecting an auditory versus a visual target (identical to Fig. 9 in the manuscript). B, 44 Hz signal over the same cluster chosen for A. The analysis was identical with the analysis performed in  A, apart from the frequency for the band-pass filter.

      We do, however, not specifically argue against the possibility of a broadband increase when anticipating an auditory compared to a visual target. But even a broadband-increase would directly contradict the alpha inhibition hypothesis, which poses that an increase in alpha completely disengages the whole cortex. We will clarify this point in the revised manuscript.

      (2) Moreover, 36Hz visual and 40Hz auditory signals are expected to be filtered in the neocortex. Applying standard filters and Hilbert transform to estimate sensory evoked potentials appears to rely on huge assumptions that are not fully substantiated in this paper. In Figure 4, 36Hz "visual" and 40Hz "auditory" signals seem largely indistinguishable from one another, suggesting that the analysis failed to fully demix these signals.

      We appreciate the reviewer’s insightful concern regarding the filtering and demixing of the 36 Hz visual and 40 Hz auditory signals, and we share the same reservations about the reliance on standard filters and the Hilbert transform method.

      To address this, we would like to draw attention to Author response image 1, which demonstrates that a 4 Hz difference is sufficient to effectively demix the signals using our chosen filtering and Hilbert transform approach. We believe that the reason the 36 Hz visual and 40 Hz auditory signals show similar topographies lies not in incomplete demixing but rather in the possibility that this condition difference reflects sensory integration, rather than signal contamination.

      This interpretation is further supported by our findings with the intermodulation frequency at 4 Hz, which also suggests cross-modal integration. Furthermore, source localization analysis revealed that the strongest condition differences were observed in the precuneus, an area frequently associated with sensory integration processes. We will expand on this in the discussion section to better clarify this point.

      (3) The asymmetric results in the visual and auditory modalities preclude a modality-general conclusion about the function of alpha. However, much of the language seems to generalize across sensory modalities (e.g., use of the term 'sensory' rather than 'visual').

      We thank the reviewer for pointing this out and agree that in some cases we have not made a good enough distinction between visual and sensory. We will make sure, that when using ‘sensory’, we either describe overall theories, which are not visual-exclusive or refer to the possibility of a broad sensory increase. However, when directly discussing our results and the interpretation thereof, we will now use ‘visual’ in the revised manuscript.

      (4) In this vein, some of the conclusions would be far more convincing if there was at least a trend towards symmetry in source-localized analyses of MEG signals. For example, how does alpha power in the primary auditory cortex (A1) compare when anticipating auditory vs visual target? What do the frequency-tagged visual and auditory responses look like when just looking at the primary visual cortex (V1) or A1?

      We thank the reviewer for this important suggestion and have added a virtual channel analysis. We were however, not interested in alpha power in primary auditory cortex, as we were specifically interested in the posterior alpha, which is usually increased when expecting an auditory compared to a visual target (and used to be interpreted as a blanket inhibition of the visual cortex). We will improve upon the clarity concerning this point in the manuscript.

      We have however, followed the reviewer’s suggestion of a virtual channel analysis, showing that the condition differences are not observable in primary visual cortex for the 36 Hz visual signal and in primary auditory cortex for the 40 Hz auditory signal. Our data clearly shows that there is an alpha condition difference in V1, while there no condition difference for 36 Hz in V1 and for 40 Hz in Heschl’s Gyrus (see Author response image 2).

      Author response image 2.

      Virtual channels for V1 and Helschl’s gyrus. A, alpha power for the virtual channel created in V1 (Calcerine_L and Calcerine_R from AAL atlas; Tzourio-Mazoyer et al., 2002, NeuroImage). A cluster permutation analysis over time (between -2 and 0) revealed a significant condition difference between ~ -2 and -1.7 s (p = 0.0449). B, 36 Hz frequency-tagging signal for the virtual channel created in V1 (equivalent to the procedure in A). The same cluster permutation as performed in A revealed no significant condition differences. C, 40 Hz frequency-tagging signal for the virtual channel created in Heschl’s gryrus (Heschl_L and Heschl_R from AAL atlas; Tzourio-Mazoyer et al., 2002, NeuroImage). The same cluster permutation as performed in A revealed no significant condition differences.

      (5) Blinking would have a huge impact on the subject's ability to ignore the visual distractor. The best thing to do would be to exclude from analysis all trials where the subjects blinked during the cue-to-target interval. The authors mention that in the MEG experiment, "To remove blinks, trials with very large eye-movements (> 10 degrees of visual angle) were removed from the data (See supplement Fig. 5)." This sentence needs to be clarified since eye-movements cannot be measured during blinking. In addition, it seems possible to remove putative blink trials from EEG experiments as well, since blinks can be detected in the EEG signals.

      We thank the reviewer for mentioning that we were making this point confusing. From the MEG-data, we removed eyeblinks using ICA. Alone for the supplementary Fig. 5 analysis, we used the eye-tracking data to confirm that participants were in fact fixating the centre of the screen. For this analysis, we removed trials with blinks (which can be seen in the eye-tracker as huge amplitude movements or as large eye-movements in degrees of visual angle; see Author response image 3 below to show a blink in the MEG data and the according eye-tracker data in degrees of visual angle). We will clarify this in the methods section.

      As for the concern closed eyes to ignore visual distractors, in both experiments we can observe highly significant distractor cost in accuracy for visual distractors, which we hope will convince the reviewer that our visual distractors were working as intended.

      Author response image 3.

      Illustration of eye-tracker data for a trial without and a trial with a blink. All data points recorded during this trial are plottet. A, ICA component 1, which reflects blinks and its according data trace in a trial. No blink is visible. B, eye-tracker data transformed into degrees of visual angle for the trial depicted in A. C, ICA component 1, which reflects blinks and its according data trace in a trial. A clear blink is visible. D, eye-tracker data transformed into degrees of visual angle for the trial depicted in C.

      (6) It would be interesting to examine the neutral cue trials in this task. For example, comparing auditory vs visual vs neutral cue conditions would be indicative of whether alpha was actively recruited or actively suppressed. In addition, comparing spectral activity during cue-to-target period on neutral-cue auditory correct vs incorrect trials should mimic the comparison of auditory-cue vs visual-cue trials. Likewise, neutral-cue visual correct vs incorrect trials should mimic the attention-related differences in visual-cue vs auditory-cue trials.

      We thank the reviewer for this suggestion. We have analysed the neutral cue trials in the EEG dataset (see suppl. Fig. 1) and will expand this figure to show all conditions. There were no significant differences to auditory or visual cues, but descriptively alpha power was higher for neutral cues compared to visual cues and lower for neutral cues compared to auditory cues. While this may suggest that for visual trials alpha is actively suppressed and for auditory trials actively recruited, we do not feel comfortable to make this claim, as the neutral condition may not reflect a completely neutral state. The neutral task can still be difficult, especially because of the uncertainty of the target modality.

      As for the analysis of incorrect versus correct trials, we love the idea, but unfortunately the accuracy rate was quite high so that the number of incorrect trials would not be sufficient to perform a reliable analysis.

      (7) In the abstract, the authors state that "This implies that alpha modulation does not solely regulate 'gain control' in early sensory areas but rather orchestrates signal transmission to later stages of the processing stream." However, I don't see any supporting evidence for the latter claim, that alpha orchestrates signal transmission to later stages of the processing stream. If the authors are claiming an alternative function to alpha, this claim should be strongly substantiated.

      We thank the reviewer for pointing out, that we have not sufficiently explained our case. The first point refers to gain control akin to the alpha inhibition hypothesis, which claims that increases in alpha disengage a whole cortical area. Since we have confirmed the alpha increase in our data to originate from primary visual cortex through source analysis, this should lead to decreased visual processing. The increase in 36 Hz visual processing therefore directly contradicts the alpha inhibition hypothesis. We propose an alternative explanation for the functionality of alpha activity in this task. Through pulsed inhibition, information packages of relevant visual information could be transmitted down the processing stream, thereby enhancing relevant visual signal transmission. We believe the fact that the enhanced visual 36 Hz signal we found correlated with visual alpha power on a trial-by-trial basis, and did not originate from primary visual cortex, but from areas known for sensory integration supports our claim.

      We will make this point clearer in our revised manuscript.

      Reviewer #2 (Public review):

      Brickwedde et al. investigate the role of alpha oscillations in allocating intermodal attention. A first EEG study is followed up with a MEG study that largely replicates the pattern of results (with small to be expected differences). They conclude that a brief increase in the amplitude of auditory and visual stimulus-driven continuous (steady-state) brain responses prior to the presentation of an auditory - but not visual - target speaks to the modulating role of alpha that leads them to revise a prevalent model of gating-by-inhibition.

      Overall, this is an interesting study on a timely question, conducted with methods and analysis that are state-of-the-art. I am particularly impressed by the author's decision to replicate the earlier EEG experiment in MEG following the reviewer's comments on the original submission. Evidently, great care was taken to accommodate the reviewer's suggestions.

      We thank the reviewer for the positive feedback and expression of interest in the topic of our manuscript.

      Nevertheless, I am struggling with the report for two main reasons: It is difficult to follow the rationale of the study, due to structural issues with the narrative and missing information or justifications for design and analysis decisions, and I am not convinced that the evidence is strong, or even relevant enough for revising the mentioned alpha inhibition theory. Both points are detailed further below.

      We thank the reviewer for raising this important point. We will revise our introduction and results in line with the reviewer’s suggestions, hoping that our rationale will then be easier to follow and that our evidence will be more convincing.

      Strength/relevance of evidence for model revision: The main argument rests on 1) a rather sustained alpha effect following the modality cue, 2) a rather transient effect on steady-state responses just before the expected presentation of a stimulus, and 3) a correlation between those two. Wouldn't the authors expect a sustained effect on sensory processing, as measured by steady-state amplitude irrespective of which of the scenarios described in Figure 1A (original vs revised alpha inhibition theory) applies? Also, doesn't this speak to the role of expectation effects due to consistent stimulus timing? An alternative explanation for the results may look like this: Modality-general increased steady-state responses prior to the expected audio stimulus onset are due to increased attention/vigilance. This effect may be exclusive (or more pronounced) in the attend-audio condition due to higher precision in temporal processing in the auditory sense or, vice versa, too smeared in time due to the inferior temporal resolution of visual processing for the attend-vision condition to be picked up consistently. As expectation effects will build up over the course of the experiment, i.e., while the participant is learning about the consistent stimulus timing, the correlation with alpha power may then be explained by a similar but potentially unrelated increase in alpha power over time.

      We thank the reviewer for raising these insightful questions and suggestions.

      It is true that our argument rests on a rather sustained alpha effect and a rather transient effect on steady-state responses and a correlation between the two. However, this connection would not be expected under the alpha inhibition hypothesis, which states that alpha activity would inhibit a whole cortical area (when irrelevant to the task), exerting “gain control”. This notion directly contradicts our results of the “irrelevant” visual information a) being transmitted at all and b) increasing.

      However, it has been shown on many occasions that alpha activity exerts pulsed inhibition, so we proposed an alternative theory of an involvement in signal transmission. In this case, the cyclic inhibition would serve as an ordering system, which only allows for high-priority information to pass, resulting in higher signa-to-noise. We do not make a claim about how fast or when these signals are transmitted in relation to alpha power. For instance, it could be that alpha power increases as a preparatory state even before signal is actually transmitted.  Zhigalov (2020 Hum. Brain M.) has shown that in V1, frequency-tagging responses were up-and down regulated with attention – independent of alpha activity.

      But we do believe that the fact that visual alpha power correlates on a trial-by-trial level with visual 36 Hz frequency-tagging increases and (a relationship which has not been found in V1, see Zhigalov 2020, Hum. Brain Mapp.) suggest a strong connection. Furthermore, the fact that the alpha modulation originates from early visual areas and occurs prior to any frequency-tagging changes, while the increase in frequency-tagging can be observed in areas which are later in the processing stream (such as the precuneus) is strongly indicative for an involvement of alpha power in the transmission of this signal. We cannot fully exclude alternative accounts and mechanisms which effect both alpha power and frequency-tagging responses. 

      We do believe that the alternative account described by the reviewer does not contradict our theory, as we do believe that the alpha power modulation may reflect an expectation effect (and the idea that it could be related to the resolution of auditory versus visual processing is very interesting!). It is also possible that this expectation is, as the reviewer suggests, related to attention/vigilance and might result in a modality-general signal increase. And indeed, we can observe an increase in the frequency-tagging response in sensory integration areas. Accordingly, we believe that the alternative explanation provided by the reviewer contradicts the alpha inhibition hypothesis, but not necessarily our alternative theory.

      We will revise the discussion, which we hope will make our case stronger and easier to follow. Additionally, we will mention the possibility for alternative explanations as well as the possibility, that alpha networks fulfil different roles in different locations/task environments.

      Structural issues with the narrative and missing information: Here, I am mostly concerned with how this makes the research difficult to access for the reader. I list the major points below:

      In the introduction the authors pit the original idea about alpha's role in gating against some recent contradictory results. If it's the aim of the study to provide evidence for either/or, predictions for the results from each perspective are missing. Also, it remains unclear how this relates to the distinction between original vs revised alpha inhibition theory (Fig. 1A). Relatedly if this revision is an outcome rather than a postulation for this study, it shouldn't be featured in the first figure.

      We agree with the reviewer that we have not sufficiently clarified our goal as well as how different functionalities of alpha oscillations would lead to different outcomes. We will revise the introduction and restructure the results and hope that it will be easier to follow.

      The analysis of the intermodulation frequency makes a surprise entrance at the end of the Results section without an introduction as to its relevance for the study. This is provided only in the discussion, but with reference to multisensory integration, whereas the main focus of the study is focussed attention on one sense. (Relatedly, the reference to "theta oscillations" in this sections seems unclear without a reference to the overlapping frequency range, and potentially more explanation.) Overall, if there's no immediate relevance to this analysis, I would suggest removing it.

      We thank the reviewer for pointing this out and will add information about this frequency to the introduction part. We believe that the intermodulation frequency analysis is important, as it potentially supports the notion that condition differences in the visual-frequency tagging response are related to downstream processing rather than overall visual information processing in V1. We would therefore prefer to leave this analysis in the manuscript.

      Reviewer #3 (Public review):

      Brickwedde et al. attempt to clarify the role of alpha in sensory gain modulation by exploring the relationship between attention-related changes in alpha and attention-related changes in sensory-evoked responses, which surprisingly few studies have examined given the prevalence of the alpha inhibition hypothesis. The authors use robust methods and provide novel evidence that alpha likely exhibits inhibitory control over later processing, as opposed to early sensory processing, by providing source-localization data in a cross-modal attention task.

      This paper seems very strong, particularly given that the follow-up MEG study both (a) clarifies the task design and separates the effect of distractor stimuli into other experimental blocks, and (b) provides source-localization data to more concretely address whether alpha inhibition is occurring at or after the level of sensory processing, and (c) replicates most of the EEG study's key findings.

      We are very grateful to the reviewer for their positive feedback and evaluation of our work.

      There are some points that would be helpful to address to bolster the paper. First, the introduction would benefit from a somewhat deeper review of the literature, not just reviewing when the effects of alpha seem to occur, but also addressing how the effect can change depending on task and stimulus design (see review by Morrow, Elias & Samaha (2023).

      We thank the reviewer for this suggestion and agree. We will add a paragraph to the introduction which refers to missing correlation studies and the impact of task design.

      Additionally, the discussion could benefit from more cautionary language around the revision of the alpha inhibition account. For example, it would be helpful to address some of the possible discrepancies between alpha and SSEP measures in terms of temporal specificity, SNR, etc. (see Peylo, Hilla, & Sauseng, 2021). The authors do a good job speculating as to why they found differing results from previous cross-modal attention studies, but I'm also curious whether the authors think that alpha inhibition/modulation of sensory signals would have been different had the distractors been within the same modality or whether the cues indicated target location, rather than just modality, as has been the case in so much prior work?

      We thank the reviewer for suggesting these interesting discussion points and will include a paragraph in our discussion which goes deeper into these topics.

      Overall, the analyses and discussion are quite comprehensive, and I believe this paper to be an excellent contribution to the alpha-inhibition literature.

    1. Author response:

      To Reviewer #1:

      Thank you for your thorough review and comments on our work, which you described as “the role of neuritin in T cell biology studied here is new and interesting.”.  We have summarized your comments into two categories: biology and investigation approach, experimental rigor, and data presentation.

      Biology and Investigation approach comments:

      (1) Questions regarding the T cell anergy model:

      Major point “(4) Figure 1E-H. The authors assume that this immunization protocol induces anergic cells, but they provide no experimental evidence for this. It would be useful to show that T cells are indeed anergic in this model, especially those that are OVA-specific. The lack of IL-2 production by Cltr cells could be explained by the presence of fewer OVA-specific cells, rather than by an anergic status.”

      T cell anergy is a well-established concept first described by Schwartz’s group. It refers to the hyporesponsive T cell functional state in antigen-experienced CD4 T cells (Chappert and Schwartz, 2010; Fathman and Lineberry, 2007; Jenkins and Schwartz, 1987; Quill and Schwartz, 1987).  Anergic T cells are characterized by their inability to expand and to produce IL2 upon subsequent antigen re-challenge. In this paper, we have borrowed the existing in vivo T cell anergy induction model used by Mueller’s group for T cell anergy induction (Vanasek et al., 2006).  Specifically, Thy1.1+ Ctrl or Nrn1-/- TCR transgenic OTII cells were co-transferred with the congenically marked Thy1.2+ WT polyclonal Treg cells into TCR-/- mice.  After anergy induction, the congenically marked TCR transgenic T cells were recovered by sorting based on Thy1.1+ congenic marker, and subsequently re-stimulation ex vivo with OVA323-339 peptide. We evaluated the T cell anergic state based on OTII cell expansion in vivo and IL2 production upon OVA323-339 restimulation ex vivo.  

      “The authors assume that this immunization protocol induces anergic cells, but they provide no experimental evidence for this.”

      Because the anergy model by Mueller's group is well established (Vanasek et al., 2006), we did not feel that additional effort was required to validate this model as the reviewer suggested. Moreover, the limited IL2 production among the control cells upon restimulation confirms the validity of this model.

      “The lack of IL-2 production by Cltr cells could be explained by the presence of fewer OVAspecific cells, rather than by an anergic status”.

      Cells from Ctrl and Nrn1-/- mice on a homogeneous TCR transgenic (OTII) background were used in these experiments. The possibility that substantial variability of TCR expression or different expression levels of the transgenic TCR could have impacted IL2 production rather than anergy induction is unlikely.

      Overall, we used this in vivo anergy model to evaluate the Nrn1-/- T cell functional state in comparison to Ctrl cells under the anergy induction condition following the evaluation of Nrn1 expression, particularly in anergic T cells.  Through studies using this anergy model, we observed a significant change in Treg induction among OTII cells. We decided to pursue the role of Nrn1 in Treg cell development and function rather than the biology of T cell anergy as evidenced by subsequent experiments.

      Minor points “(6) On which markers are anergic cells sorted for RNAseq analysis?”

      Cells were sorted out based on their congenic marker marking Ctrl or Nrn1-/- OTII cells transferred into the host mice.  We did not specifically isolate anergic cells for sequencing.

      (2) Question regarding the validity of iTreg differentiation model.

      Major point: “(5) Figure 2A-C and Figure 3. The use of iTregs to try to understand what is happening in vivo is problematic. iTregs are cells that have probably no equivalent in vivo, and so may have no physiological relevance. In any case, they are different from pTreg cells generated in vivo. Working with pTreg may be challenging, that is why I would suggest generating data with purified nTreg. Moreover, it was shown in the article of Gonzalez-Figueroa 2021 that Nrn1-/- nTreg retained a normal suppressive function, which would not be what is concluded by the authors of this manuscript. Moreover, we do not even know what the % of Foxp3 cells is in the iTreg used (after differentiation and 20h of re-stimulation) and whether this % is the same between Ctlr and Nrn1 KO cells.”.

      We thank Reviewer #1 for their feedback. While it is true that iTregs made in vitro and in vivo generated pTregs display several distinctions (e. g., differences in Foxp3 expression stability, for example), we strongly disagree with this statement by Revieweer#1 “The use of iTregs to try to understand what is happening in vivo is problematic. iTregs are cells that have probably no equivalent in vivo, and so may have no physiological relevance.” The induced Treg cell (iTreg) model was established over 20 years ago (Chen et al., 2003; Zheng et al., 2002), and the model is widely adopted with over 2000 citations. Further, it has been instrumental in understanding different aspects of regulatory T cell biology (Hurrell et al., 2022; John et al., 2022; Schmitt and Williams, 2013; Sugiura et al., 2022).   

      Because we have observed reduced pTreg generation in vivo, we choose to use the in vitro iTreg model system to understand the mechanistic changes involved in Treg cell differentiation and function, specifically, neuritin’s role in this process. We have made no claim that iTreg cell biology is identical to pTreg generated in vivo or nTreg cells. However, the iTreg culture system has proved to be a good in vitro system for deciphering molecular events involved in complex processes. As such, it remains a commonly used approach by many research groups in the Treg cell field (Hurrell et al., 2022; John et al., 2022; Sugiura et al., 2022). Moreover, applying the iTreg in vitro culture system has been instrumental in helping us identify the cell electrical state change in Nrn1-/- CD4 cells and revealed the biological link between Nrn1 and the ionotropic AMPA receptor (AMPAR), which we will discuss in the subsequent discussion. It is technically challenging to use nTreg cells for T cell electrical state studies due to their heterogeneous nature from development in an in vivo environment and the effect of manipulation during the nTreg cell isolation process, which can both affect the T cell electrical state.   

      “Moreover, it was shown in the article of Gonzalez-Figueroa 2021 that Nrn1-/- nTreg retained a normal suppressive function, which would not be what is concluded by the authors of this manuscript.” 

      We have also carried out nTreg studies in vitro in addition to iTreg cells. Similar to Gonzalez-Figueroa et al.'s findings, we did not observe differences in suppression function between Nrn1-/- and WT nTreg using the in vitro suppression assay. However, Nrn1-/- nTreg cells revealed reduced suppression function in vivo (Fig. 2D-L). In fact, Gonzalez-Figueroa et al. observed reduced plasma cell formation after OVA immunization in Treg-specific Nrn1-/- mice, implicating reduced suppression from Nrn1-/- follicular regulatory T (Tfr) cells. Thus, our observation of the reduced suppression function of Nrn1-/- nTreg toward effector T cell expansion, as presented in Fig. 2D-L, does not contradict the results from Gonzalez-Figueroa et al. Rather, the conclusions of these two studies agree that Nrn1 can play important roles in immune suppression observable in vivo that are not captured readily by the in vitro suppression assay.

      “Moreover, we do not even know what the % of Foxp3 cells is in the iTreg used (after differentiation and 20h of re-stimulation) and whether this % is the same between Ctlr and Nrn1 KO cells.”

      We have stated in the manuscript on page 7 line 208 that “Similar proportions of Foxp3+ cells were observed in Nrn1-/- and Ctrl cells under the iTreg culture condition, suggesting that Nrn1 deficiency does not significantly impact Foxp3+ cell differentiation”. In the revised manuscript, we will include the data on the proportion of Foxp3+ cells before iTreg restimulation.

      (3) Confirmation of transcriptomic data regarding amino acids or electrolytes transport change

      Minor point“(3) Would not it be possible to perform experiments showing the ability of cells to transport amino acids or electrolytes across the plasma membrane? This would be a more interesting demonstration than transcriptomic data.”

      We appreciate Review# 1’s suggestion regarding “perform experiments showing the ability of cells to transport amino acids or electrolytes across the plasma membrane”.  We have indeed already performed such experiments corroborating the transcriptomics data on differential amino acid and nutrient transporter expression. Specifically, we loaded either iTreg or Th0 cells with membrane potential (MP) dye and measured MP level change after adding the complete set of amino acids (complete AA).  Upon entry, the charge carried by AAs may transiently affect cell membrane potential. Different AA transporter expression patterns may show different MP change patterns upon AA entry, as we showed in Author response image 1. We observed reduced MP change in Nrn1-/- iTreg compared to the Ctrl, whereas in the context of Th0 cells, Nrn1-/- showed enhanced MP change than the Ctrl. We can certainly include these data in the revised manuscript.

      Author response image 1.

      Membrane potential change induced by amino acids entry. a. Nrn1-/- or WT iTreg cells loaded with MP dye and MP change was measured upon the addition of a complete set of AAs. b. Nrn1-/- or WT Th0 cells loaded with MP dye and MP change was measured upon the addition of a complete set of AAs.

      (4) EAE experiment data assessment

      Minor point ”(5) Figure 5F. How are cells re-stimulated? If polyclonal stimulation is used, the experiment is not interesting because the analysis is done with lymph node cells. This analysis should either be performed with cells from the CNS or with MOG restimulation with lymph node cells.”

      In the EAE study, the Nrn1-/- mice exhibit similar disease onset but a protracted non-resolving disease phenotype compared to the WT control mice.  Several reasons may contribute to this phenotype: 1. Enhanced T effector cell infiltration/persistence in the central nervous system (CNS); 2. Reduced Treg cell-mediated suppression to the T effector cells in the CNS; 3. Protracted non-resolving inflammation at the immunization site has the potential to continue sending T effector cells into CNS, contributing to persistent inflammation. Based on this reasoning, we examined the infiltrating T effector cell number and Treg cell proportion in the CNS.  We also restimulated cells from draining lymph nodes close to the inflammation site, looking for evidence of persistent inflammation.  When mice were harvested around day 16 after immunization, the inflammation at the local draining lymph node should be at the contraction stage.  We stimulated cells with PMA and ionomycin intended to observe all potential T effector cells involved in the draining lymph node rather than only MOG antigen-specific cells.  We disagree with Reviewer #1’s assumption that “This analysis should either be performed with cells from the CNS or with MOG restimulation with lymph node cells.”. We think the experimental approach we have taken has been appropriately tailored to the biological questions we intended to answer.

      Experimental rigor and data presentation.

      (1) Data labeling and additional supporting data

      Major points (2) The authors use Nrn1+/+ and Nrn1+/- cells indiscriminately as control cells on the basis of similar biology between Nrn1+/+ and Nrn1+/- cells at homeostasis. However, it is quite possible that the Nrn1+/- cells have a phenotype in situations of in vitro activation or in vivo inflammation (cancer, EAE). It would be important to discriminate Nrn1+/- and Nrn1+/+ cells in the data or to show that both cell types have the same phenotype in these conditions too.

      (3) Figure 1A-D. Since the authors are using the Nrp1 KO mice, it would be important to confirm the specificity of the anti-Nrn1 mAb by FACS. Once verified, it would be important to add FACS results with this mAb in Figures 1A-C to have single-cell and quantitative data as well.

      Minor points  

      (1) Line 119, 120 of the text. It is said that one of the most up-regulated genes in anergic cells is Nrn1 but the data is not shown.

      (2) For all figures showing %, the titles of the Y axes are written in an odd way. For example, it is written "Foxp3% CD4". It would be more conventional and clearer to write "% Foxp3+ / CD4+" or "% Foxp3+ among CD4+".

      (4) For certain staining (Figure 3E, H) it would be important to show the raw data, in addition to MFI or % values.

      We can adapt the labeling and provide additional data, including Nrn1 staining on Treg cells and flow graphs for pmTOR and pS6 staining (Fig. 3H), as requested by Reviewer #1.

      (2) Experimental rigor:

      General comments:

      “However, it is disappointing that reading this manuscript leaves an impression of incomplete work done too quickly.”

      We were discouraged to receive the comment, “this manuscript leaves an impression of incomplete work done too quickly.” Our study of this novel molecule began without any existing biological tools such as antibodies, knockout mice, etc.  Over the past several years, we have established our own antibodies for Nrn1 detection, obtained and characterized Nrn1 knockout mice, and utilized multiple approaches to identify the molecular mechanism of Nrn1 function. Through the use of the in vitro iTreg system described in this manuscript, we identified the association of Nrn1 deficiency with cell electrical state change, potentially connected to AMPAR function. We have further corroborated our findings by generating Nrn1 and AMPAR T cell specific double knockout mice and confirmed that T cell specific AMPAR deletion could abrogate the phenotype caused by the Nrn1 deficiency (see Author response image 2).  We did not include the double knockout data in the current manuscript because AMPAR function has not yet been studied thoroughly in T cell biology, and we feel this topic warrants examination in its own right.  However, the unpublished data support the finding that Nrn1 modulates the T cell electrical state and, consequently, metabolism, ultimately influencing tolerance and immunity.  In its current form, the manuscript represents the first characterization of the novel molecule Nrn1 in anergic cells, Tregs, and effector T cells. While this work has led to several exciting additional questions, we disagree that the novel characterization we have presented Is incomplete. We feel that our present data set, which squarely highlights Nrn1’s role as an important immune regulator while shedding unprecedented light on the molecular events involved, will be of considerable interest to a broad field of researchers.

      “Multiple models have been used, but none has been studied thoroughly enough to provide really conclusive and unambiguous data. For example, 5 different models were used to study T cells in vivo. It would have been preferable to use fewer, but to go further in the study of mechanisms.”

      We have indeed used multiple in vivo models to reveal Nrn1's function in Treg differentiation, Treg suppression function, T effector cell differentiation and function, and the overall impact on autoimmune disease. Because the impact of ion channel function is often context-dependent, we examined the biological outcome of Nrn1 deficiency in several in vivo contexts.  We would appreciate it if Reviewer#1 would provide a specific example, given the Nrn1 phenotype, of how to proceed deeper to investigate the electrical change in the in vivo models.

      “Major points (1) A real weakness of this work is the fact that in most of the results shown, there are few biological replicates with differences that are often small between Ctrl and Nrn1 -/-. The systematic use of student's t-test may lead to thinking that the differences are significant, which is often misleading given the small number of samples, which makes it impossible to know whether the distributions are Gaussian and whether a parametric test can be used. RNAseq bulk data are based on biological duplicates, which is open to criticism.”

      We respectfully disagree with Reviewer #1 on the question of statistical power and significance to our work. We have used 5-8 mice/group for each in vivo model and 3-4 technical replicates for the in vitro studies, with a minimum of 2-3 replicate experiments. These group sizes and replication numbers are in line with those seen in high-impact publications. While some differences between Ctrl and Nrn1-/- appear small, they have significant biological consequences, as evidenced by the various Nrn1-/- in vivo phenotypes. Furthermore, we believe we have subjected our data to the appropriate statistical tests to ensure rigorous analysis and representation of our findings.

      To Reviewer #2.

      We thank Reviewer #2 for the careful review of the manuscript. We especially appreciate the comments that “The characterizations of T cell Nrn1 expression both in vitro and in vivo are comprehensive and convincing. The in vivo functional studies of anergy development, Treg suppression, and EAE development are also well done to strengthen the notion that Nrn1 is an important regulator of CD4 responsiveness.”

      “The major weakness of this study stems from a lack of a clear molecular mechanism involving Nrn1. “  

      We fully understand this comment from Reviewer #2. The main mechanism we identified contributing to the functional defect of Nrn1-/- T cells involves novel effects on the electric and metabolic state of the cells. Although we referenced neuronal studies that indicate Nrn1 is the auxiliary protein for the ionotropic AMPA-type glutamate receptor (AMPAR) and may affect AMPAR function, we did not provide any evidence in this manuscript as the topic requires further in-depth study.   

      For the benefit of this discussion, we include our preliminary Nrn1 and AMPAR double knockout data (Author response image 2), which indicates that abrogating AMPAR expression can compensate for the defect caused by Nrn1 deficiency in vitro and in vivo. This preliminary data supports the notion that Nrn1 modulates AMPAR function, which causes changes in T cell electric and metabolic state, influencing T cell differentiation and function.  

      Author response image 2.

      Deletion of AMPAR expression in T cells compensates for the defect caused by Nrn1 deficiency. Nrn1-/- mice were crossed with T cell-specific AMPAR knockout mice (AMPARfl/flCD4Cre+) mice. The following mice were generated and used in the experiment: T cell specific AMPAR-knockout and Nrn1 knockout mice (AKONKO), Nrn1 knockout mice (AWTNKO), Ctrl mice (AWTNWT). a. Deletion of AMPAR compensates for the iTreg cell defect observed in Nrn1-/- CD4 cells. iTreg live cell proportion, cell number, and Ki67 expression among Foxp3+ cells 3 days after aCD3 restimulation. b. Deletion of AMPAR in T cells abrogates the enhanced autoimmune response in Nrn1-/- Mouse in the EAE disease model. Mouse relative weight change and disease score progression after EAE disease induction.  

      Ion channels can influence cell metabolism through multiple means (Vaeth and Feske, 2018; Wang et al., 2020). First, ion channels are involved in maintaining cell resting membrane potential. This electrical potential difference across the cell membrane is essential for various cellular processes, including metabolism (Abdul Kadir et al., 2018; Blackiston et al., 2009; Nagy et al., 2018; Yu et al., 2022). Second, ion channels facilitate the movement of ions across cell membranes. These ions are essential for various metabolic processes. For example, ions like calcium (Ca2+), potassium (K+), and sodium (Na+) play crucial roles in signaling pathways that regulate metabolism (Kahlfuss et al., 2020). Third, ion channel activity can influence cellular energy balance due to ATP consumption associated with ion transport to maintain ion balances (Erecińska and Dagani, 1990; Gerkau et al., 2019). This, in turn, can impact processes like ATP production, which is central to cellular metabolism. Thus, ion channel expression and function determine the cell’s bioelectric state and contribute to cell metabolism (Levin, 2021).

      Because the AMPAR function has not been thoroughly studied using a genetic approach in T cells, we do not intend to include the double knockout data in this manuscript before fully characterizing the T cell-specific AMPAR knockout mice.  

      “Although the biochemical and informatics studies are well-performed, it is my opinion that these results are inconclusive in part due to the absence of key "naive" control groups. This limits my ability to understand the significance of these data.

      Specifically, studies of the electrical and metabolic state of Nrn1-/- inducible Treg cells (iTregs) would benefit from similar data collected from wild-type and Nrn1-/- naive CD4 T cells.”

      We appreciate the reviewer’s comments. This comment reflects two concerns in data interpretation:

      (1) Are Nrn1-/- naïve T cells fundamentally different from WT cells? Does this fundamental difference contribute to the observed electrical and metabolic phenotype in iTreg or Th0 cells? This is a very good question we will perform the experiments as the reviewer suggested. While Nrn1 is expressed at a basal (low) level in naïve T cells, deletion of Nrn1 may cause changes in naïve T cell phenotype.   

      (2) Is the Nrn1-/- phenotype caused by Nrn1 functional deficiency or due to the secondary effect of Nrn1 deletion, such as non-physiological cell membrane structure changes?

      We have done the following experiment to address this concern.  We have cultured WT T cells in the presence of Nrn1 antibody and compared the outcome with Nrn1-/- iTreg cells (Author response image 3). WT iTreg cells under antibody blockade exhibited similar changes as Nrn1-/- iTreg cells, confirming the physiological relevance of the Nrn1-/- phenotype.

      Author response image 3.

      Nrn1 antibody blockade in WT iTreg cell culture caused similar phenotypic change as in Nrn1-/- iTreg cells. Nrn1-/- and WT CD4 cells were differentiated under iTreg condition in the presence of anti-Nrn1 (aNrn1) antibody or isotype control for 3 days. Cells were restimulated with anti-CD3 and in the presence of aNrn1 or isotype. a. MP measured 18hr after anti-CD3 restimulation. b. live CD4 cell number and proportion of Ki67 expression among live cells three days after restimulation. c. The proportion of Foxp3+ cells among live cells three days after restimulation.  

      Reference:

      Abdul Kadir, L., M. Stacey, and R. Barrett-Jolley. 2018. Emerging Roles of the Membrane Potential: Action Beyond the Action Potential. Front Physiol 9:1661.

      Blackiston, D.J., K.A. McLaughlin, and M. Levin. 2009. Bioelectric controls of cell proliferation: ion channels, membrane voltage and the cell cycle. Cell Cycle 8:3527-3536.

      Chappert, P., and R.H. Schwartz. 2010. Induction of T cell anergy: integration of environmental cues and infectious tolerance. Current opinion in immunology 22:552-559.

      Chen, W., W. Jin, N. Hardegen, K.J. Lei, L. Li, N. Marinos, G. McGrady, and S.M. Wahl. 2003. Conversion of peripheral CD4+CD25- naive T cells to CD4+CD25+ regulatory T cells by TGF-beta induction of transcription factor Foxp3. The Journal of experimental medicine 198:1875-1886.

      Erecińska, M., and F. Dagani. 1990. Relationships between the neuronal sodium/potassium pump and energy metabolism. Effects of K+, Na+, and adenosine triphosphate in isolated brain synaptosomes. J Gen Physiol 95:591-616.

      Fathman, C.G., and N.B. Lineberry. 2007. Molecular mechanisms of CD4+ T-cell anergy. Nat Rev Immunol 7:599-609.

      Gerkau, N.J., R. Lerchundi, J.S.E. Nelson, M. Lantermann, J. Meyer, J. Hirrlinger, and C.R. Rose. 2019. Relation between activity-induced intracellular sodium transients and ATP dynamics in mouse hippocampal neurons. The Journal of physiology 597:5687-5705.

      Hurrell, B.P., D.G. Helou, E. Howard, J.D. Painter, P. Shafiei-Jahani, A.H. Sharpe, and O. Akbari. 2022. PD-L2 controls peripherally induced regulatory T cells by maintaining metabolic activity and Foxp3 stability. Nature communications 13:5118.

      Jenkins, M.K., and R.H. Schwartz. 1987. Antigen presentation by chemically modified splenocytes induces antigen-specific T cell unresponsiveness in vitro and in vivo. The Journal of experimental medicine 165:302-319.

      John, P., M.C. Pulanco, P.M. Galbo, Jr., Y. Wei, K.C. Ohaegbulam, D. Zheng, and X. Zang. 2022. The immune checkpoint B7x expands tumor-infiltrating Tregs and promotes resistance to anti-CTLA-4 therapy. Nature communications 13:2506.

      Kahlfuss, S., U. Kaufmann, A.R. Concepcion, L. Noyer, D. Raphael, M. Vaeth, J. Yang, P. Pancholi, M. Maus, J. Muller, L. Kozhaya, A. Khodadadi-Jamayran, Z. Sun, P. Shaw, D. Unutmaz, P.B. Stathopulos, C. Feist, S.B. Cameron, S.E. Turvey, and S. Feske. 2020. STIM1-mediated calcium influx controls antifungal immunity and the metabolic function of nonpathogenic Th17 cells. EMBO molecular medicine 12:e11592.

      Levin, M. 2021. Bioelectric signaling: Reprogrammable circuits underlying embryogenesis, regeneration, and cancer. Cell 184:1971-1989.

      Nagy, E., G. Mocsar, V. Sebestyen, J. Volko, F. Papp, K. Toth, S. Damjanovich, G. Panyi, T.A. Waldmann, A. Bodnar, and G. Vamosi. 2018. Membrane Potential Distinctly Modulates Mobility and Signaling of IL-2 and IL-15 Receptors in T Cells. Biophys J 114:2473-2482.

      Quill, H., and R.H. Schwartz. 1987. Stimulation of normal inducer T cell clones with antigen presented by purified Ia molecules in planar lipid membranes: specific induction of a long-lived state of proliferative nonresponsiveness. Journal of immunology (Baltimore, Md. : 1950) 138:3704-3712.

      Schmitt, E.G., and C.B. Williams. 2013. Generation and function of induced regulatory T cells. Frontiers in immunology 4:152.

      Sugiura, A., G. Andrejeva, K. Voss, D.R. Heintzman, X. Xu, M.Z. Madden, X. Ye, K.L. Beier, N.U. Chowdhury, M.M. Wolf, A.C. Young, D.L. Greenwood, A.E. Sewell, S.K. Shahi, S.N. Freedman, A.M. Cameron, P. Foerch, T. Bourne, J.C. Garcia-Canaveras, J. Karijolich, D.C. Newcomb, A.K. Mangalam, J.D. Rabinowitz, and J.C. Rathmell. 2022. MTHFD2 is a metabolic checkpoint controlling effector and regulatory T cell fate and function. Immunity 55:65-81.e69.

      Vaeth, M., and S. Feske. 2018. Ion channelopathies of the immune system. Current opinion in immunology 52:39-50.

      Vanasek, T.L., S.L. Nandiwada, M.K. Jenkins, and D.L. Mueller. 2006. CD25+Foxp3+ regulatory T cells facilitate CD4+ T cell clonal anergy induction during the recovery from lymphopenia. Journal of immunology (Baltimore, Md. :1950) 176:5880-5889.

      Wang, Y., A. Tao, M. Vaeth, and S. Feske. 2020. Calcium regulation of T cell metabolism. Current opinion in physiology 17:207-223.

      Yu, W., Z. Wang, X. Yu, Y. Zhao, Z. Xie, K. Zhang, Z. Chi, S. Chen, T. Xu, D. Jiang, X. Guo, M. Li, J. Zhang, H. Fang, D. Yang, Y. Guo, X. Yang, X. Zhang, Y. Wu, W. Yang, and D. Wang. 2022. Kir2.1-mediated membrane potential promotes nutrient acquisition and inflammation through regulation of nutrient transporters. Nature communications 13:3544.

      Zheng, S.G., J.D. Gray, K. Ohtsuka, S. Yamagiwa, and D.A. Horwitz. 2002. Generation ex vivo of TGF-beta-producing regulatory T cells from CD4+CD25- precursors. Journal of immunology (Baltimore, Md. : 1950) 169:4183-4189.

    1. Author response:

      Reviewer #1 (Public review):

      Summary:

      Loh and colleagues investigate valence encoding in the mesolimbic dopamine system. Using an elegant approach, they show that sucrose, which normally evokes strong dopamine neuron activity and release in the nucleus accumbens, is made aversive via conditioned taste aversion, the same sucrose stimulus later evokes much less dopamine neuron activity and release. Thus, dopamine activity can dynamically track the changing valence of an unconditioned stimulus. These results are important for helping clarify valence and value related questions that are the matter of ongoing debate regarding dopamine functions in the field.

      Strengths:

      This is an elegant way to ask this question, the within subject's design and the continuity of the stimulus is a strong way to remove a lot of the common confounds that make it difficult to interpret valence-related questions. I think these are valuable studies that help tie up questions in the field while also setting up a number of interesting future directions. There are number of control experiments and tweaks to the design that help eliminate a number of competing hypotheses regarding the results. The data are clearly presented and contextualized.

      Weaknesses for consideration:

      The focus on one relatively understudied region of the rat striatum for dopamine recordings could potentially limit generalization of the findings. While this can be determined in future studies, the implications should be further discussed in the current manuscript.

      We agree that the manuscript would benefit from providing a stronger rationale for our recording sites and acknowledging the potential for regional differences in dopamine signaling. We have made the following additions to the manuscript:

      Added to the Discussion: “Recordings were targeted to the lateral VTA and the corresponding approximate terminal site in the NAc lateral shell (Lammel et al., 2008). Subregional differences in dopamine activity likely contribute to mixed findings on dopamine and affect. For example, dopamine in the NAc lateral shell differentially encodes cues predictive of rewarding sucrose and aversive footshock, which is distinct from NAc medial shell dopamine responses (de Jong et al., 2019). Our findings are similar to prior work from our group targeting recordings to the NAc dorsomedial shell (Hsu et al., 2020; McCutcheon et al., 2012; Roitman et al., 2008): there, intraoral sucrose increased NAc dopamine release while the response in the same rats to quinine was significantly lower.”

      Reviewer #2 (Public review):

      Summary:

      Koh et al. report an interesting manuscript studying dopamine binding in the lateral accumbens shell of rats across the course of conditioned taste aversion. The question being asked here is how does the dopamine system respond to aversion? The authors take advantage of unique properties of taste aversion learning (notably, within-subjects remapping of valence to the same physical stimulus) to address this.

      They combine a well controlled behavioural design (including key, unpaired controls) with fibre photometry of dopamine binding via GrabDA and of dopamine neuron activity by gCaMP, careful analyses of behaviour (e.g., head movements; home cage ingestion), the authors show that, 1) conditioned taste aversion of sucrose suppresses the activity of VTA dopamine neurons and lateral shell dopamine binding to subsequent presentations of the sucrose tastant; 2) this pattern of activity was similar to the innately aversive tastant quinine; 3) dopamine responses were negatively correlated with behavioural (inferred taste reactivity) reactivity; and 4) dopamine responses tracked the contingency of between sucrose and illness because these responses recovered across extinction of the conditioned taste aversion.

      Strengths:

      There are important strengths here. The use of a well-controlled design, the measurement of both dopamine binding and VTA dopamine neuron activity, the inclusion of an extinction manipulation; and the thorough reporting of the data. I was not especially surprised by these results, but these data are a potentially important piece of the dopamine puzzle (e.g., as the authors note, salience-based argument struggles to explain these data).

      Weaknesses for consideration:

      (1) The focus here is on the lateral shell. This is a poorly investigated region in the context of the questions being asked here. Indeed, I suspect many readers might expect a focus on the medial shell. So, I think this focus is important. But, I think it does warrant greater attention in both the introduction and discussion. We do know from past work that there can be extensive compartmentalisation of dopamine responses to appetitive and aversive events and many of the inconsistent findings in the literature can be reconciled by careful examination of where dopamine is assessed. I do think readers would benefit from acknowledgement this - for example it is entirely reasonable to suppose that the findings here may be specific to the lateral shell.

      As with our response to Reviewer 1, we agree that we should provide further rationale for focusing our recordings on the lateral shell and acknowledge potential differences in dopamine dynamics across NAc subregions. In addition to the changes in the Discussion detailed in our response to Reviewer 1, we have made the following additions to the Introduction:

      Added to the Introduction: “NAc lateral shell dopamine differentially encodes cues predictive of rewarding (i.e., sipper spout with sucrose) and aversive stimuli (i.e., footshock), which is distinct from other subregions (de Jong et al., 2019). It is important to note that other regions of the NAc may serve as hedonic hotspots (e.g. dorsomedial shell; or may more closely align with the signaling of salience (e.g. ventromedial shell; (Yuan et al., 2021)).”

      (2) Relatedly, I think readers would benefit from an explicit rationale for studying the lateral shell as well as consideration of this in the discussion. We know that there are anatomical (PMID: 17574681), functional (PMID: 10357457), and cellular (PMID: 7906426) differences between the lateral shell and the rest of the ventral striatum. Critically, we know that profiles of dopamine binding during ingestive behaviours there can be highly dissimilar to the rest of ventral striatum (PMID: 32669355). I do think these points are worth considering.

      There are several reasons why dopamine dynamics were recorded in the NAc lateral shell:

      (1) Dopamine neurons in more medial aspects of the VTA preferentially target the NAc medial shell and core whereas dopamine neurons in the lateral VTA – our target for VTA DA recordings – project to the lateral shell of the NAc (Lammel et al., 2008). Thus, our goal was to sample NAc release dynamics in areas that receive projections from our cell body recording sites.

      (2) Cues predictive of reward availability (i.e., sipper spout with sucrose) and aversive stimuli (i.e., footshock) are differentially encoded by NAc lateral shell dopamine, which is distinct from NAc ventromedial shell dopamine responses (de Jong et al., 2019). These findings suggest a role for NAc lateral shell dopamine in the encoding of a stimulus’s valence, which made the subregion an area of interest for further examination.

      (3) With respect to the medial NAc shell specifically, extensive literature had already shown it to be a ‘hedonic hotspot’ (Morales and Berridge, 2020; Yuan et al., 2021) whereas the ventral portion is more mixed with respect to valence (Yuan et al., 2021). We had previously shown that intraoral infusions of primary taste stimuli of opposing valence (i.e., sucrose and quinine) evoke differential responses in dopamine release within the NAc dorsomedial shell (Roitman et al., 2008). We more recently replicated differential dopamine responses from dopamine cell bodies in the lateral VTA (Hsu et al., 2020) and thus endeavored to the possibility of changing dopamine responses in the lateral VTA to the same stimulus as its valence changes. As a result of these choices, measuring dopamine release in the lateral shell was a logical choice. The field would greatly benefit from continued future work surveying the entirety of the VTA DA projection terminus. 

      We have included these points of justification in the Introduction and Discussion sections.

      (3) I found the data to be very thoughtfully analysed. But in places I was somewhat unsure:

      (a) Please indicate clearly in the text when photometry data show averages across trials versus when they show averages across animals.

      We have now explicitly indicated in the figure legends of Figures 1, 3, 5, 7, and 8:

      (1) In heat maps, each row represents the averaged (across rats) response on that trial.

      (2) Traces below heat maps represent the response to infusion averaged first across trials for each rat and then across all rats.

      (3) Insets represent the average z-score across the infusion period averaged first across all trials for each rat and then across all rats.

      (b) I did struggle with the correlation analyses, for two reasons.

      (i) First, the key finding here is that the dopamine response to intraoral sucrose is suppressed by taste aversion. So, this will significantly restrict the range of dopamine transients, making interpretation of the correlations difficult.

      The overall hypothesis is that the dopamine response would correlate with the valence of a taste stimulus – even and especially when the stimulus remained constant but its valence changed. We inferred valence from the behavioral reactivity to the stimulus – reasoning that an appetitive taste will evoke minimal movement of the nose and paws (presumably because the animals are primarily engaging in small mouth movements associated with ingestion as shown by the seminal work of Grill and Norgren (1978) and the many studies published by the K.C. Berridge group) whereas an aversive taste will evoke significantly more movement as the rats engage in rejection responses (e.g. forelimb flails, chin rubs, etc.). When we conducted our regression analyses we endeavored to be as transparent as possible and labeled each symbol based on group (Unpaired vs Paired) and day (Conditioning vs Test). Both behavioral reactivity and dopamine responses change – but only for the Paired rats across days. In this sense, we believe the interpretation is clear. However, the Reviewer raises an important criticism that there would essentially be a floor effect with dopamine responses. We believe this is mitigated by data acquired across extinction and especially in Figure 9B. Here, the observations that dopamine responses fall to near zero but return to pre-conditioning levels in the Paired group with strong correlation between dopamine and behavioral reactivity throughout would hopefully partially allay the Reviewer’s concerns. See Part ii below for further support.

      (ii) Second, the authors report correlations by combining data across groups/conditions. I understand why the authors have done this, but it does risk obscuring differences between the groups. So, my question is: what happens to this trend when the correlations are computed separately for each group? I suspect other readers will share the same question. I think reporting these separate correlations would be very helpful for the field -

      regardless of the outcome.

      To address this concern, we performed separate regression analyses for Paired and Unpaired rats and provide the table below to detail results where data were combined across groups or separated. Expectedly, all analyses in Paired rats indicated a significant inverse relationship between dopamine and behavioral reactivity. Afterall, it is only in this group where behavioral reactivity to the taste stimulus changes as function of conditioning. Perhaps even more striking is that in almost all comparisons, even when restricting the regression analysis to Unpaired rats, we still observed a significant inverse relationship between dopamine and behavioral reactivity in most experiments. We have outlined the separated correlations below (asterisks denote slopes significantly different from 0; * p<0.05; ** p<0.01; *** p<0.005; **** p<0.001):

      Author response table 1.

      (4) Figure 1A is not as helpful as it might be. I do think readers would expect a more precise reporting of GCaMP expression in TH+ and TH- neurons. I also note that many of the nuances in terms of compartmentalisation of dopamine signalling discussed above apply to ventral tegmental area dopamine neurons (e.g. medial v lateral) and this is worth acknowledging when interpreting t

      Others have reported (Choi et al., 2020) and quantified (Hsu et al., 2020) GCaMP6f expression in TH+ neurons. While we didn’t report these quantifications, our observations were very much in line with previous quantifications from our laboratory (Hsu et al. 2020).

      We agree that we should elaborate on VTA subregional differences and have answered this response above (See responses to Reviewer 1 Weakness #1 and Reviewer 2 Weakness #2).

      Reviewer #3 (Public review):

      Summary:

      This study helps to clarify the mixed literature on dopamine responses to aversive stimuli. While it is well accepted that dopamine in the ventral striatum increases in response to various rewarding and appetitive stimuli, aversive stimuli have been shown to evoke phasic increases or decreasing depending on the exact aversive stimuli, behavioral paradigm, and/or dopamine recording method and location examined. Here the authors use a well-designed set of experiments to show differential responses to an appetitive primary reward (sucrose) that later becomes a conditioned aversive stimulus (sucrose previously paired with lithium chloride in a conditioned taste aversion paradigm). The results are interesting and add valuable data to the question of how the mesolimbic dopamine system encodes aversive stimuli, however, the conclusions are strongly stated given that the current data do not necessarily align with prior conflicting data in terms of recording location, and it is not clear exactly how to interpret the generally biphasic dopamine response to the CTA-sucrose which also evolves over exposures within a single session.

      Strengths:

      • The authors nicely demonstrate that their two aversive stimuli examined, quinine and sucrose following CTA, evoked aversive facial expressions and paw movements that differed from those following rewarding sucrose to support that the stimuli experienced by the rats differ in valence.

      • Examined dopamine responses to the exact same sensory stimuli conditioned to have opposing valences, avoiding standard confounds of appetitive and aversive stimuli being sensed by different sensory modalities (i.e., sweet taste vs. electric shock)

      • The authors examined multiple measurements of dopamine activity - cell body calcium (GCaMP6f) in midbrain and release in NAc (Grab-DA2h), which is useful as the prior mixed literature on aversive dopamine responses comes from a variety of recording methods.

      • Correlations between sucrose preference and dopamine signals demonstrate behavioral relevance of the differential dopamine signals.

      • The delayed testing experiment in Figure 7 nicely controls for the effect of time to demonstrate that the "rewarding" dopamine response to sucrose only recovers after multiple extinction sucrose exposures to extinguish the CTA.

      Weaknesses for consideration:

      (1) Regional differences in dopamine signaling to aversive stimuli are mentioned in the introduction and discussion. For instance, the idea that dopamine encodes salience is strongly argued against in the discussion, but the paper cited as arguing for that (Kutlu et al. 2021) is recording from the medial core in mice. Given other papers cited in the text about the regional differences in dopamine signaling in the NAc and from different populations of dopamine neurons in midbrain, it's important to mention this distinction wrt to salience signaling. Relatedly, the text says that the lateral NAc shell was targeted for accumbens recordings, but the histology figure looks like the majority of fibers were in the anterior lateral core of NAc. For the current paper to be a convincing last word on the issue, it would be extremely helpful to have similar recordings done in other parts of the NAc to do a more thorough comparison against other studies.

      As the Reviewer notes, NAc dopamine recordings were aimed at the lateral NAc shell. It is possible that some dopamine neurons lying within the anterior lateral core were recorded. Fiber photometry and the size of the fiber optics cannot definitively identify the precise location and number of dopamine neurons from which we recorded. Still, recording sites did not systematically differ between groups. Further, the within-subjects design helps to mitigate any potential biases for one subregion over another. The results presented in the manuscript strongly support a valence code. It is difficult to be the ‘last word’ on this topic and we suspect debate will continue. We used taste stimuli for appetitive and aversive stimuli – whereas many in the field will continue to use other noxious stimuli (e.g. foot shock) that likely recruit different circuits en route to the VTA. And there may very well be a different regional profile for dopamine signaling with different noxious stimuli. Moreover, we used intraoral infusion to avoid confounds of stimulus avoidance and competing motivations (e.g. food or fluid deprivation). We believe that this is one of the most important and unique features of our report. Recent work supports a role for phasic increases in dopamine in avoidance of noxious stimuli (Jung et al., 2024) and it will be critical for the field to reflect on the differences between avoidance and aversion. Moreover, in ongoing studies we aspire to fully survey dopamine signaling in conditioned taste aversion across the medial-lateral and dorsal-ventral axes of the VTA and NAc.

      (2) Dopamine release in the NAc never dips below baseline for the conditioned sucrose. Is it possible to really consider this as a signal for valence per se, as opposed to it being a weaker response relative to the original sucrose response?

      Indeed, NAc dopamine release to intraoral quinine nor aversive sucrose doesn’t dip below baseline but rather dopamine binding doesn’t change from pre-infusion baseline levels. It should be noted that VTA dopamine cell body activity does indeed dip below baseline in response to aversive sucrose. Moreover, using fast-scan cyclic voltammetry, we showed that dopamine release dips below baseline in the NAc dorsomedial shell in response to intraoral quinine (Roitman et al., 2008). The differences across recording sites may reflect regional differences but they may also reflect differences in recording approaches. GrabDA2h, used here, has relatively slow kinetics that may obscure dips below baseline (see response Weakness# 8 below).

      (3) Related to this, the main measure of the dopamine signal here, "mean z-score," obscures the temporal dynamics of the aversive dopamine response across a trial. This measure is used to claim that sucrose after CTA is "suppressing" dopamine neuron activity and release, which is true relative to the positive valence sucrose response. However, both GRAB-DA and cell-body GCaMP measurements show clear increases after onset of sucrose infusion before dipping back to baseline or slightly below in the average of all example experiments displayed. One could point to these data to argue either that aversive stimuli cause phasic increases in dopamine (due to the initial increase) or decreases (due to the delayed dip below baseline) depending on the measurement window. Some discussion of the dynamics of the response and how it relates to the prior literature would be useful.

      We have used mean z-score to do much of our quantitative analyses but the Reviewer raises the intriguing possibility that we are masking an initial increase in dopamine release and VTA DA activity evoked by aversive taste by doing so. We included the heat maps in the manuscript to be as transparent as possible about the time course of dopamine responses – both within a trial and across trials. The Reviewer’s point prompted us to reflect further on the heat maps and recognize that trials early in the session often showed a brief increase in dopamine for aversive sucrose but this response dissipated (NAc dopamine release) or flipped (VTA DA cell body activity) over trials. We now quantitatively characterize this feature by looking at the timecourse of dopamine responses in each third of the trials (1-10, 11-20, 21-30; see Author response images 1,2 and 3). As we infer the valence of the stimulus from nose and paw movements (behavioral reactivity), it is especially striking that we a similar timecourse for changes in behavior. Collectively, the data may reflect an updating process that is relatively slow and requires experience of the stimulus in a new (aversive) state – that is, a model-free process. While our experiments were not designed to test the updating of dopamine responses and discern their participation in model-based versus model-free learning processes – another debate in the dopamine field (Cone et al., 2016; Deserno et al., 2021)– the data reflect a model-free process. This is further supported in the experiment involving multiple conditioning sessions, where dopamine ‘dips’ are observed in trials 1-10 on Conditioning Day 3 and Extinction Day 1 when the new value of sucrose has been established. Finally, the relatively slow updating of the value of sucrose is reflected in older literature using a continuous intraoral infusion. Using this approach, rats began rejecting the saccharin infusion only after ~2min rather than immediately (Schafe et al., 1998; Schafe and Bernstein, 1996; Wilkins and Bernstein, 2006).   

      Author response image 1.

      Author response image 2.

      Author response image 3.

      (4) Would this delayed below-baseline dip be visible with a shorter infusion time?

      While our experiments did not explore this parameter, it would be interesting to parametrically vary infusion duration times and examine differences in dopamine responses. However, we believe the most parsimonious explanation is that the ‘dip’ in VTA cell body activity develops as a function of the slow updating of the value of sucrose reflective of a model-free process. We recognize that this is mere speculation.

      (5) Does the max of the increase or the dip of the decrease better correlate with the behavioral measures of aversion (orofacial, paw movements) or sucrose preference than "mean z-score" measure used here?

      It seems plausible that finding the most extreme value from baseline could better correlate to behavioral measures. Time courses to max increase and max decrease are different. Moreover, with appetitive sucrose, there are often multiple transients that occur throughout a single intraoral infusion. Coupled with a noisy time course for individual components of behavioral reactivity, we determined that averaging data across the whole infusion period (i.e. mean z-score) was the most objective way we could analyze the dopamine and behavioral responses to taste stimuli.

      (6) The authors argue strongly in the discussion against the idea that dopamine is encoding "salience." Could this initial peak (also seen in the first few trials of quinine delivery, fig 1c color plot) be a "salience" response?

      Our response above to the potential for ‘mixed’ dopamine responses to aversive sucrose led to additional analyses that support a slow updating of both behavior and dopamine to the new, aversive value of sucrose. Quinine is innately aversive and thus the Reviewer rightly points out that even here we observe an increase in dopamine release evoked by quinine on the first few trials (as observed in the heat map). We’d like to note, though, that the order of stimulus exposure was counterbalanced across rats. In those rats first receiving a sucrose session, quinine initially caused a modest increase in dopamine release during the first 10 trials (which is more pronounced in the first 2 trials). In the subsequent 2 blocks of 10 trials, no such increase was observed. Interestingly, in rats for which quinine was their first stimulus, we did not see an increase in dopamine release on the first few trials (see Author response image 4). We speculate that the initial sucrose session required the value of intraoral infusions to be updated when quinine was delivered to these rats and that, once more, the updating process may be slow and akin to a model-free process. This analysis, at present, is underpowered but will direct future attention in follow-up work.

      Author response image 4.

      (7) Related to this, the color plots showing individual trials show a reduction in the increases to positive valence sucrose across conditioning day trials and a flip from infusion-onset increase to delayed increases across test day trials. This evolution across days makes it appear that the last few conditioning day trials would be impossible to discriminate from the first few test day trials in the CTA-paired. Presumably, from strength of CTA as a paradigm, the sucrose is already aversive to the animals at the first trial of test day. Why do the authors think the response evolves across this session?

      As the Reviewer noted, Points 3-7 are related. We have speculated that the evolving dopamine response in Paired rats across test day trials reflects a model-free process. Importantly, as in the manuscript, our additional analyses once again show a tight relationship between behavioral reactivity and the dopamine response across the test session trials. It is important to note, though, that these experiments were not designed to test if responses reflect model-free or model-based processes.

      (8) Given that most of the work is using a conditioned aversive stimulus, the comparison to a primary aversive tastant quinine is useful. However, the authors saw basically no dopamine response to a primary aversive tastant quinine (measured only with GRAB-DA) and saw less noticeable decreases following CTA for NAc recordings with GRAB-DA2h than with cell body GCaMP. Given that they are using the high-affinity version of the GRAB sensor, this calls into question whether this is a true difference in release vs. soma activity or issue of high affinity release sensor making decreases in dopamine levels more difficult to observe.

      We share the same speculation as the Reviewer. Using fast-scan cyclic voltammetry, albeit measuring dopamine concentration in the dorsomedial shell, we observed a clear decrease from baseline with intraoral infusions of quinine (Roitman et al., 2008). Using fiber photometry here, the Reviewer and we note that GRAB_DA2h is a high-affinity (i.e., EC50: 7nM) dopamine sensor with relatively long off-kinetics (i.e., t1/2 decay time: 7300ms) (Labouesse et al., 2020). It may therefore be much more difficult to observe decreases (below baseline) using this sensor. The publication of new dopamine sensors - with lower affinity, faster kinetics, and greater dynamic range (Zhuo et al., 2024) – introduces opportunities for comparison and the greater potential for capturing decreases below baseline. Due to the poorer kinetics associated with GRAB_DA2h, we would not assert that direct comparisons between the GCaMP- and GRAB-based signals observed here represent true differences between somatic and terminal activity.

      References

      Choi JY, Jang HJ, Ornelas S, Fleming WT, Fürth D, Au J, Bandi A, Engel EA, Witten IB. 2020. A Comparison of Dopaminergic and Cholinergic Populations Reveals Unique Contributions of VTA Dopamine Neurons to Short-Term Memory. Cell Rep 33. doi:10.1016/j.celrep.2020.108492

      Cone JJ, Fortin SM, McHenry JA, Stuber GD, McCutcheon JE, Roitman MF. 2016. Physiological state gates acquisition and expression of mesolimbic reward prediction signals. Proc Natl Acad Sci U S A 113. doi:10.1073/pnas.1519643113

      de Jong JW, Afjei SA, Pollak Dorocic I, Peck JR, Liu C, Kim CK, Tian L, Deisseroth K, Lammel S. 2019. A Neural Circuit Mechanism for Encoding Aversive Stimuli in the Mesolimbic Dopamine System. Neuron 101. doi:10.1016/j.neuron.2018.11.005

      Deserno L, Moran R, Michely J, Lee Y, Dayan P, Dolan RJ. 2021. Dopamine enhances model-free credit assignment through boosting of retrospective model-based inference. Elife 10. doi:10.7554/eLife.67778

      Hsu TM, Bazzino P, Hurh SJ, Konanur VR, Roitman JD, Roitman MF. 2020. Thirst recruits phasic dopamine signaling through subfornical organ neurons. Proc Natl Acad Sci U S A 117:30744–30754. doi:10.1073/PNAS.2009233117/-/DCSUPPLEMENTAL

      Jung K, Krüssel S, Yoo S, An M, Burke B, Schappaugh N, Choi Y, Gu Z, Blackshaw S, Costa RM, Kwon HB. 2024. Dopamine-mediated formation of a memory module in the nucleus accumbens for goal-directed navigation. Nat Neurosci. doi:10.1038/s41593-024-01770-9

      Labouesse MA, Cola RB, Patriarchi T. 2020. GPCR-based dopamine sensors—A detailed guide to inform sensor choice for in vivo imaging. Int J Mol Sci. doi:10.3390/ijms21218048

      Lammel S, Hetzel A, Häckel O, Jones I, Liss B, Roeper J. 2008. Unique Properties of Mesoprefrontal Neurons within a Dual Mesocorticolimbic Dopamine System. Neuron 57. doi:10.1016/j.neuron.2008.01.022

      McCutcheon JE, Ebner SR, Loriaux AL, Roitman MF, Tobler PN. 2012. Encoding of aversion by dopamine and the nucleus accumbens. Front Neurosci 6. doi:10.3389/fnins.2012.00137

      Morales I, Berridge KC. 2020. ‘Liking’ and ‘wanting’ in eating and food reward: Brain mechanisms and clinical implications. Physiol Behav. doi:10.1016/j.physbeh.2020.113152

      Roitman MF, Wheeler RA, Wightman RM, Carelli RM. 2008. Real-time chemical responses in the nucleus accumbens differentiate rewarding and aversive stimuli. Nature Neuroscience 2008 11:12 11:1376–1377. doi:10.1038/nn.2219

      Schafe GE, Bernstein IL. 1996. Forebrain contribution to the induction of a brainstem correlate of conditioned taste aversion: I. The amygdala. Brain Res 741. doi:10.1016/S0006-8993(96)00906-7

      Schafe GE, Thiele TE, Bernstein IL. 1998. Conditioning method dramatically alters the role of amygdala in taste aversion learning. Learning and Memory 5. doi:10.1101/lm.5.6.481

      Wilkins EE, Bernstein IL. 2006. Conditioning method determines patterns of c-fos expression following novel taste-illness pairing. Behavioural Brain Research 169. doi:10.1016/j.bbr.2005.12.006

      Yuan L, Dou YN, Sun YG. 2021. Topography of reward and aversion encoding in the mesolimbic dopaminergic system. Journal of Neuroscience 39. doi:10.1523/JNEUROSCI.0271-19.2019

      Zhuo Y, Luo B, Yi X, Dong H, Miao X, Wan J, Williams JT, Campbell MG, Cai R, Qian T, Li F, Weber SJ, Wang L, Li B, Wei Y, Li G, Wang H, Zheng Y, Zhao Y, Wolf ME, Zhu Y, Watabe-Uchida M, Li Y. 2024. Improved green and red GRAB sensors for monitoring dopaminergic activity in vivo. Nat Methods 21. doi:10.1038/s41592-023-02100-w

    1. Author response:

      Reviewer #1 (Public review):

      Summary:

      This is an interesting study on the role of FGF signaling in the induction of primitive streak-like cells (PS-LC) in human 2D-gastruloids. The authors use a previously characterized standard culture that generates a ring of PS-LCs (TBXT+) and correlate this with pERK staining. A requirement for FGF signaling in TBXT induction is demonstrated via pharmacological inhibition of MEK and FGFR activity. A second set of culture conditions (with no exogenous FGFs) suggests that endogenous FGFs are required for pERK and TBXT induction. The authors then characterize, via scRNA-seq, various components of the FGF pathway (genes for ligands, receptors, ERK regulators, and HSPG regulation). They go on to characterize the pFGFR1, receptor isoforms, and polarized localization of this receptor. Finally, they perform FGF4 inhibition and use a cell line with a limited FGF17 inactivation (heterozygous null) and show that loss of these FGFs reduces PS-LC and derivative cell types.

      Strengths:

      (1) As the authors point out, the role of FGF signaling in gastrulation is less well understood than other signaling pathways. Hence this is a valuable contribution to that field.

      (2) The FGF4 and FGF17 loss-of-function experiments in Figure 5 are very intriguing. This is especially so given the intriguing observation that these FGFs appear to be dominating in this model of human gastrulation, in contrast to what FGFs dominate in mice, chicks, and frogs.

      (3) In general this paper is valuable as a further development of the Human gastruloid system and the role of FGF signaling in the induction of PS-CLs. The wide net that the authors cast in characterizing the FGF ligand gene, receptor isoforms, and downstream components provides a foundation for future work. As the authors write near the beginning of the Discussion "Many questions remain."

      We thank the reviewer for these positive comments.

      Weaknesses:

      (1) FGFs are cell survival factors in various aspects of development. The authors fail to address cell death due to loss of FGF signaling in their experiments. For example, in Figure 1E (which requires statistical analysis) and 1G (the bottom FGFRi row), there appears to be a significant amount of cell loss. Is this due to cell death? The authors should address the question of whether the role of FGF/ERK signaling is to keep the cells alive.

      Indeed, FGF also strongly affects cell number and it is an interesting question to what extent this depends on ERK. Our manuscript focuses instead on the role of FGF/ERK signaling in cell fate patterning. However, as mentioned in our discussion, figure 1de show that doxycycline induced pERK leads to more TBXT+ cells than the control without restoring cell number, suggesting the role of FGF in controlling cell number is independent of the requirement for FGF/ERK in PS-LC differrentiation. Unpublished data below showing a MEK inhibitor dose response further supports this: low doses of MEKi are sufficient to inhibit differentiation without affecting cell number. To address the reviewer’s question we will include this data in the revised manuscript and perform several additional experiments to determine in more detail how cell death and proliferation depend on FGF.

      Author response image 1.

      MEK affects differentiation and cell number at different doses. a-c) control and MEKi (0.3uM) treated colonies with similar cell number but different TBXT expression. d-f) quantification of cell number per colonies (d), percentage of TBXT-positive cell per colony (e), and the distribution of pERK intensities for different doses of MEK inhibitor (f). N>6 colonies per condition. MEKi = PD0325901. Scalebar = 50 micron.

      (2) Regarding the sparse cells in 1G, is there a reduction in cell number only with FGFRi and not MEKi? Is this reproducible? Gattiglio et al (Development, 2023, PMID: 37530863) present data supporting a "community effect" in the FGF-induced mesoderm differentiation of mouse embryonic stem cells. Could a community effect be at play in this human system (especially given the images in the bottom row of 1G)? If the authors don't address this experimentally they should at least address the ideas in Gattoglio et al.

      Indeed, FGFRi reproducibly affects cell number more than MEKi, in line with the fact that pathways downstream of FGF other than MAPK/ERK (e.g. PI3K) play important roles in cell survival and growth. We think the lack of differentiation in MEKi and FGFRi in Fig.1g cannot be attributed to a loss of cells combined with a community effect. This is because without FGFRi or MEKi cells also differentiate to primitive streak at much lower densities than those shown, consistent with the data we show above in response to (1), which argue against a primarily indirect effect of FGF on PS-LC differentiation through cell density. In the context of directed differentiation (rather than 2D gastruloids), we will show this in a controlled manner by repeating the experiment in Fig.1g while adjusting cell seeding densities to obtain similar final cell densities in all three conditions. We will also include Gattoglio et al. in our revised discussion.

      (3) Do the FGF4 and FGF17 LOF experiments in Figure 5 affect cell numbers like FGFRi in Figure 1?

      It seems the effect on cell number is small but we will analyze this carefully and include it in the revised manuscript. A small effect would be consistent with our unpublished data below showing a near uniform proliferation rate. This in turn suggests that low levels of pERK in the center are sufficient to maintain proliferation there while the much higher pERK levels in the PS-LC ring (that we think depend on FGF4 and FGF17) do not signifcantly increase the proliferation rate (see Fig.1 in the manuscript for the pERK pattern). Thus, loss of high pERK in PS-LC ring while maintaining low pERK throughout would not be expected to have a major impact on cell number but would impact differentiation. In contrast, loss of all FGF signaling through FGFRi does dramatically affect cell number. This is again consistent with the data provided in response to (1) showing that ERK levels can be reduced to a point where PS-LC differentiation is lost without significantly affecting cell number. We will include the data below in the revised manuscript.

      Author response image 2.

      Why examine PS-LC induction only in FGF17 heterozygous cells and not homozygous FGF17 nulls?

      We were unable to obtain homozygous FGF17 nulls, it is not clear if there is a reason for this. We will try again and otherwise attempt to corroborate our findings with further knockdown data.

      (4) The idea that FGF8 plays a dominant role during gastrulation of other species but not humans is so intriguing it warrants deeper testing. The authors dismiss FGF8 because its mRNA "...levels always remained low." (line 363) as well as the data published in Zhai et al (PMID: 36517595) and Tyser et al (PMID: 34789876). But there are cases in mouse development where a gene was expressed at levels so low, that it might be dismissed, and yet LOF experiments revealed it played a role or even was required in a developmental process. The authors should consider FGF8 inhibition or inactivation to explore its potential role, despite its low levels of expression.

      We agree with the reviewer that FGF8 is worth investigating further and we will now pursue this.

      (5) Redundancy is a common feature in FGF genetics. What is the effect of inhibiting FGF4 in FGF17 LOF cells?

      We will attempt to do the experiment the reviewer suggests.

      (6) I suggest stating that the authors take more caution in describing FGF gradients. For example, in one Results heading they write "Endogenous FGF4 and FGF17 gradients underly the ERK activity pattern.", implying an FGF protein gradient. However, they only present data for FGF mRNA , not protein. This issue would be clarified if they used proper nomenclature for gene, mRNA (italics), and protein (no italics) throughout the paper.

      We will edit the paper to more clearly distinguish protein and mRNA.

      Reviewer #2 (Public review):

      Summary:

      The role of FGFs in embryonic development and stem cell differentiation has remained unclear due to its complexity. In this study, the authors utilized a 2D human stem cell-based gastrulation model to investigate the functions of FGFs. They discovered that FGF-dependent ERK activity is closely linked to the emergence of primitive streak cells. Importantly, this 2D model effectively illustrates the spatial distribution of key signaling effectors and receptors by correlating these markers with cell fate markers, such as T and ISL1. Through inhibition and loss-of-function studies, they further corroborated the needs of FGF ligands. Their data shows that FGFR1 is the primary receptor, and FGF2/4/17 are the key ligands for primitive streak development, which aligns with observations in primate embryos. Additional experiments revealed that the reduction of FGF4 and FGF17 decreases ERK activity.

      Strengths:

      This study provides comprehensive data and improves our understanding of the role of FGF signaling in primate primitive streak formation. The authors provide new insights related to the spatial localization of the key components of FGF signaling and attempt to reveal the temporal dynamics of the signal propagation and cell fate decision, which has been challenging.

      Weaknesses:

      Given the solid data, the work only partially clarifies the complex picture of FGF signaling, so details remain somewhat elusive. The findings lack a strong punchline, which may limit their broader impact.

      We thank this reviewer for their valuable feedback and the compliment on the solidity of our data. The punchline of our work is that FGF4- and FGF17-dependent ERK signaling plays a key role in human PS-LC differentiation, and that these are different FGFs than those thought to drive mouse gastrulation. A second key point is that like BMP and TGFβ signaling, FGF signaling is restricted to the basolateral sides of pluripotent stem cell colonies due to polarized receptor expression, which is crucial for understanding the response to exogenous ligands added to the cell medium. Indeed, many facets of FGF signaling remain to investigated in the future, such as how FGF regulates and is regulated by other signals, which we will dedicate a different manuscript to.

      Reviewer #3 (Public review):

      Jo and colleagues set out to investigate the origins and functions of localized FGF/ERK signaling for the differentiation and spatial patterning of primitive streak fates of human embryonic stem cells in a well-established micropattern system. They demonstrate that endogenous FGF signaling is required for ERK activation in a ring-domain in the micropatterns, and that this localized signaling is directly required for differentiation and spatial patterning of specific cell types. Through high-resolution microscopy and transwell assays, they show that cells receive FGF signals through basally localized receptors. Finally, the authors find that there is a requirement for exogenous FGF2 to initiate primitive streak-like differentiation, but endogenous FGFs, especially FGF4 and FGF17, fully take over at later stages.

      Even though some of the authors' findings - such as the localized expression of FGF ligands during gastrulation and the importance of FGF/ERK signaling for cell differentiation in the primitive streak - have been reported in model organisms before, this is one of the first studies to investigate the role of FGF signaling during primitive streak-like differentiation of human cells. In doing so, the paper reports a number of interesting and valuable observations, namely the basal localization of FGF receptors which mirrors that of BMP and Nodal receptors, as well as the existence of a positive feedback loop centered on FGF signaling that drives primitive-streak differentiation. The authors also perform a comparison of the role of different FGFs across species and try to assign specific functions to individual FGFs. In the absence of clean genetic loss-of-function cell lines, this part of the work remains less strong.

      We thank the reviewer for emphasizing the value of our findings in a human model for gastrulation. We agree more loss-of-function experiments would provide further insight into the role of different FGFs, and we plan to provide additional data along these lines in the revised manuscript.

    1. Author response:

      We thank the reviewers for their thoughtful comments and constructive suggestions. We describe how we will address each point below and are grateful for the guidance on areas where our work could be clarified or expanded. In particular, we note the following:

      Selection scan summary statistics: In our revised manuscript, we will include summary statistics from the selection scans. We believe this addition will enhance transparency and provide additional context for readers.

      Reporting of outliers: As highlighted by the editor, the reviewers expressed differing views on the most appropriate way to report outliers. To provide a comprehensive and balanced presentation, we will report both the empirical selection statistics and the corresponding converted p-values. This dual approach will allow readers to fully interpret the results under both perspectives.

      Methodological considerations: We have carefully considered the reviewers' methodological suggestions and will incorporate them into our revisions where possible. These changes strengthen the rigor and clarity of the analyses.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      The paper reports an analysis of whole-genome sequence data from 40 Faroese. The authors investigate aspects of demographic history and natural selection in this population. The key findings are that the Faroese (as expected) have a small population size and are broadly of Northwest European ancestry. Accordingly, selection signatures are largely shared with other Northwest European populations, although the authors identify signals that may be specific to the Faroes. Finally, they identify a few predicted deleterious coding variants that may be enriched in the Faroes.

      Strengths:

      The data are appropriately quality-controlled and appear to be of high quality. Some aspects of the Faroese population history are characterized, in particular, by the relatively (compared to other European populations) high proportion of long runs of homozygosity, which may be relevant for disease mapping of recessive variants. The selection analysis is presented reasonably, although as the authors point out, many aspects, for example differences in iHS, can reflect differences in demographic history or population-specific drift and thus can't reliably be interpreted in terms of differences in the strength of selection.

      Weaknesses:

      The main limitations of the paper are as follows:

      (1) The data are not available. I appreciate that (even de-identified) genotype data cannot be shared; however, that does substantially reduce the value of the paper. Minimally, I think the authors should share summary statistics for the selection scans, in line with the standard of the field.

      We agree with the reviewer that sharing the selection scan results is important, so in the next revision of this manuscript we will make the selection scan summary statistics publicly available, and clearly lay out the guidelines and research questions for which the data can be accessed.

      (2) The insight into the population history of the Faroes is limited, relative to what is already known (i.e., they were settled around 1200 years ago, by people with a mixture of Scandinavian and British ancestry, have a small effective population size, and any admixture since then comes from substantially similar populations). It's obvious, for example, that the Faroese population has a smaller bottleneck than, say, GBR.

      More sophisticated analyses (for example, ARG-based methods, or IBD or rare variant sharing) would be able to reveal more detailed and fine-scale information about the history of the populations that is not already known. PCA, ADMIXTURE, and HaplotNet analysis are broad summaries, but the interesting questions here would be more specific to the Faroes, for example, what are the proportions of Scandinavian vs Celtic ancestry? What is the date and extent of sex bias (as suggested by the uniparental data) in this admixture? I think that it is a bit of a missed opportunity not to address these questions.

      We clarify that we did quantify the proportions of various ancestry components as estimated by HaploNet in main text Figure 5 and supplemental figures S5 and S6. In our revisions, we will include the average global ancestry of the various components in the Main Text so that this result is more clear.

      We agree that more fine-scale demographic analyses would be informative. We have begun working on an estimation of the admixture date, for example, but have encountered problems with using different standard date estimation software, which give very inconsistent and unstable results. We suspect this might be due to the strong bottleneck experienced in the history of the Faroe Islands breaking one or more of the assumptions of these methods. We will continue working on this problem in coming months, possibly using simulations to assess where the problem might be. We recognize that our relatively small sample size places limits on the fine-scale demographic analyses that can be performed. We are addressing this in ongoing work by generating a larger cohort, which we hope will enable more detailed inference in the future.

      (3) I don't really understand the rationale for looking at HLA-B allele frequencies. The authors write that "ankylosing spondylitis (AS) may be at a higher prevalence in the Faroe Islands (unpublished data), however, this has not been confirmed by follow-up epidemiological studies". So there's no evidence (certainly no published evidence) that AS is more prevalent, and hence nothing to explain with the HLA allele frequencies?

      We agree that no published studies have confirmed a higher prevalence of ankylosing spondylitis (AS) in the Faroe Islands. Our recruitment data suggest that AS might be more common than in other European populations, but we understand that this is only based on limited, unpublished observations and what we are hearing from the community. We emphasized in our original manuscript that this is based on observational evidence from the FarGen project. However, as this reviewer pointed out, we can be more clear that this prevalence has not been formally studied.

      In our next revision we will clarify in the text that our recruitment data suggest a higher prevalence of AS may be possible, but more formal epidemiological studies are needed to confirm this observation. The reason we study HLA-B allele frequencies is to see if the genetic background of the Faroese population could help explain this possible difference, since HLA-B27 is already known to play a strong role in AS.

      Reviewer #2 (Public review):

      In this paper, Hamid et al present 40 genomes from the Faroe Islands. They use these data (a pilot study for an anticipated larger-scale sequencing effort) to discuss the population genetic diversity and history of the sample, and the Faroes population. I think this is an overall solid paper; it is overall well-polished and well-written. It is somewhat descriptive (as might be expected for an explorative pilot study), but does make good use of the data.

      The data processing and annotation follows a state-of-the-art protocol, and at least I could not find any evidence in the results that would pinpoint towards bioinformatic issues having substantially biased some of the results, and at least preliminary results lead to the identification of some candidate disease alleles, showing that small, isolated cohorts can be an efficient way to find populations with locally common, but globally rare disease alleles.

      I also enjoyed the population structure analysis in the context of ancient samples, which gives some context to the genetic ancestry of Faroese, although it would have been nice if that could have been quantified, and it is unfortunate that the sampling scheme effectively precludes within-Faroes analyses.

      We note that although the ancestry proportions are not specified in the main text, we did quantify ancestry proportions in the modern Faroese individuals and other ancient samples, and we visualized these proportions in Figure 5 and Supplementary Figures S5 and S6. As stated in our response to Reviewer #1, in our revisions, we will more clearly state the average global ancestry of the various components in the Main Text.

      I am unfortunately quite critical of the selection analysis, both on a statistical level and, more importantly, I do not believe it measures what the authors think it does.

      Major comments:

      (1) Admixture timing/genomic scaling/localization:

      As the authors lay out, the Faroes were likely colonized in the last 1,000-1,500 years, i.e., 40-60 generations ago. That means most genomic processes that have happened on the Faroese should have signatures that are on the order of ~1-2cM, whereas more local patterns likely indicate genetic history predating the colonization of the islands. Yet, the paper seems to be oblivious to this (to me) fascinating and somewhat unique premise. Maybe this thought is wrong, but I think the authors miss a chance here to explain why the reader should care beyond the fact that the small populations might have high-frequency risk alleles and the Faroes are intrinsically interesting, but more importantly, it also makes me think it leads to some misinterpretations in the selection analysis

      See response to point #3

      (2) ROH:

      Would the sampling scheme impact ROH? How would it deal with individuals with known parental coancestry? As an example of what I mean by my previous comment, 1MB is short enough in that I would expect most/many 1MB ROH-tracts to come from pedigree loops predating the colonization of the Faroes. (i.e, I am actually quite surprised that there isn't much more long ROH, which makes me wonder if that would be impacted by the sampling scheme).

      The sampling scheme was designed to choose 40 Faroese individuals that were representative of the different regions and were minimally related. There were no pairs of third-degree relatives or closer (pi-hat > 0.125) in either the Faroese cohort or the reference populations. It is possible that this sampling scheme would reduce the amount of longer ROHs in the population, but we should still be able to see overall patterns of ROH reflective of bottlenecks in the past tens of generations. Additionally, based on this reviewer's earlier comment, 1 Mb ROHs would still be relevant to demographic events in the last 40-60 generations given that on average 1 cM corresponds to 1 Mb in humans, though we recognize that is not an exact conversion.

      That said, the “sum total amount of the genome contained in long ROH” as we described in the manuscript includes all ROHs greater than 1Mb. Although we group all ROHs longer than 1Mb into one category in the current manuscript, we can look more specifically at the distribution of the longer ROH in future revisions and add discussion into what this might tell us about the timing of bottlenecks. 

      For now, we share a plot of the distribution in ROH lengths across all individuals for each cohort. As this plot shows, there certainly are ROHs longer than 1Mb in the Faroese cohort, and on average there is a higher proportion of long ROH particularly in the 5-15 Mb range in the Faroese cohort relative to the other cohorts.

      Author response image 1.

      (3) Selection scan:

      We are talking about a bottlenecked population that is recently admixed (Faroese), compared to a population (GBR) putatively more closely related to one of its sources. My guess would be that selection in such a scenario would be possibly very hard to detect, and even then, selection signals might not differentiate selection in Faroese vs. GBR, but rather selection/allele frequency differences between different source populations. I think it would be good to spell out why XP-EHH/iHS measures selection at the correct time scale, and how/if these statistics are expected to behave differently in an admixed population.

      The reviewer brings up good points about the utility of classical selection statistics in populations that are admixed or bottlenecked, and whether the timescale at which these statistics detect selection is relevant for understanding the selective history of the Faroese population. We break down these concerns separately.

      (1) Bottlenecks: Recent bottlenecks result in higher LD within a population. However, demographic events such as bottlenecks affect global genomic patterns while positive selection is expected to affect local genomic patterns. For this reason, iHS and XP-EHH statistics are standardized against the genome-wide background, to account for population-specific demographic history.

      (2) Admixture: The term “admixture” has different interpretations depending on the line of inquiry and the populations being studied. Across various time and geographic scales, all human populations are admixed to some degree, as gene flow between groups is a common fixture throughout our history. For example,

      even the modern British population has “admixed” ancestry from North / West European sources as well, dating to at least as recently as the Medieval & Viking periods (Gretzinger et al. 2022, Leslie et al. 2015), yet we do not commonly consider it an “admixed” population, and we are not typically concerned about applying haplotype-based statistics in this population. This is due to the low divergence between the source populations. In the case of the Faroe Islands, we believe admixture likely occurred on a similar timescale. We see low variance in ancestry proportions estimated by HaploNet, both from the historical Faroese individuals (250BP) and the modern samples. This indicates admixture predating the settlement of the Faroe Islands, where recombination has had time to break up long ancestry tracts and the global ancestry proportions have reached an equilibrium. That is, these ancestry patterns suggest that the modern Faroese are most likely descended from already admixed founders. We mention this as a likely possibility in the main text: “This could have occurred either via a mixture of the original “West Europe” ancestry with individuals of predominantly “North Europe” ancestry, or a by replacement with individuals that were already of mixed ancestry at the time of arrival in the islands (the latter are not uncommon in Viking Age mainland Europe).” And, as with the case of the British population, the closely-related ancestral sources for the Faroese founders were likely not so diverged as to have differences in allele frequencies and long-range haplotypes that would disrupt signals of selection from iHS or XP-EHH.

      (3) Time scale: It is certainly possible, and in fact likely, that iHS measures selection older than the settlement of the Faroe Islands. In our manuscript, we calculated iHS in both the Faroese and the closely related British cohort, and we highlight in the main Main Text that the top signals, with the exception of LCT, are shared between the two cohorts, indicative of selection that began prior to the population split. iHS is a commonly calculated statistic, and it is often calculated in a single population without comparing to others, so we feel it is important to show our result demonstrating these shared selection signals. In future revisions, we will emphasize in the main text that we are not claiming to have identified selection that occurred in the Faroese population post-settlement with the iHS statistic. As far as XP-EHH, it is a statistic designed to identify differentiated variants that are fixed or approaching fixation in one population but not others. The time-scale of selection that XP-EHH can detect would therefore be dependent on the populations used for comparison. As XP-EHH has the best power to identify alleles that are fixed or approaching fixation in one population but not others, it is less likely to detect older selection events / incomplete sweeps from the source populations.

      In our next revision, we will more clearly state limitations of these statistics under various population histories, and clarify the time-scale at which we are detecting selection for iHS vs XP-EHH.

      (4) Similarly, for the discussion of LCT, I am not convinced that the haplotypes depicted here are on the right scale to reflect processes happening on the Faroes. Given the admixture/population history, it at the very least should be discussed in the context of whether the 13910 allele frequency on the Faroes is at odds with what would be expected based on the admixture sources.

      We agree that more investigation into the LCT allele frequency in the other ancient samples may provide some insight into the selection history, particularly in light of ancient admixture. Please note, we did look at the allele frequency of the LCT allele rs4988235 and stated in the main text that it was present at high frequencies in the historical (250BP) Faroese samples. The frequency of this allele in the imputed historical Faroese samples is 82% while the allele is present at ~74% frequency in modern samples. We did not report the exact percentage in the main text because the sample size of the historical samples (11 individuals) is small and coverage of ancient samples is low, leading to potential errors in imputation. However, we can try to calculate the LCT allele frequency in other ancient samples, and assuming that we have good proxies for the sources at the time of admixture, we may calculate the expected allele frequency in the admixed ancestors of the Faroese founders in the next revision.

      (5) I am lacking information to evaluate the procedure for turning the outliers into p-values. Both iHS and XP-EHH are ratio statistics, meaning they might be heavy-tailed if one is not careful, and the central limit theorem may not apply. It would be much easier (and probably sufficient for the points being made here) to reframe this analysis in terms of empirical outliers.

      Given that there are disagreements on the best approach to reporting selection scan results from the reviewers, in our revision, we can additionally supply both the standardized iHS / XP-EHH values in the supplementary information as well as these values transformed to p-values. As the p-values are derived from the empirical distribution, the “significant” p-values are also empirical outliers from the empirical distribution, so the conclusions of the manuscript do not change. We found that the p-value approach and controlling for FDR is more conservative, with fewer signals reaching “significance” than are considered empirical outliers based on common approaches such as IQR or arbitrary percentile cutoffs.

      (6) Oldest individual predating gene flow: It seems impossible to make any statements based on a single individual. Why is it implausible that this person (or their parents), e.g., moved to the Faroes within their lifetime and died there?

      We agree with the reviewer that this is a plausible explanation, and in future revisions we will update the main text to acknowledge this possibility.

    1. Author response:

      Reviewer #1 (Public review):

      Wang et al., recorded concurrent EEG-fMRI in 107 participants during nocturnal NREM sleep to investigate brain activity and connectivity related to slow oscillations (SO), sleep spindles, and in particular their co-occurrence. The authors found SO-spindle coupling to be correlated with increased thalamic and hippocampal activity, and with increased functional connectivity from the hippocampus to the thalamus and from the thalamus to the neocortex, especially the medial prefrontal cortex (mPFC). They concluded the brain-wide activation pattern to resemble episodic memory processing, but to be dissociated from task-related processing and suggest that the thalamus plays a crucial role in coordinating the hippocampal-cortical dialogue during sleep.

      The paper offers an impressively large and highly valuable dataset that provides the opportunity for gaining important new insights into the network substrate involved in SOs, spindles, and their coupling. However, the paper does unfortunately not exploit the full potential of this dataset with the analyses currently provided, and the interpretation of the results is often not backed up by the results presented. I have the following specific comments.

      Thank you for your thoughtful and constructive feedback. We greatly appreciate your recognition of the strengths of our dataset and findings Below, we address your specific comments and provide responses to each point you raised to ensure our methods and results are as transparent and comprehensible as possible. We hope these revisions address your comments and further strengthen our manuscript. Thank you again for the constructive feedback.

      (1) The introduction is lacking sufficient review of the already existing literature on EEG-fMRI during sleep and the BOLD-correlates of slow oscillations and spindles in particular (Laufs et al., 2007; Schabus et al., 2007; Horovitz et al., 2008; Laufs, 2008; Czisch et al., 2009; Picchioni et al., 2010; Spoormaker et al., 2010; Caporro et al., 2011; Bergmann et al., 2012; Hale et al., 2016; Fogel et al., 2017; Moehlman et al., 2018; Ilhan-Bayrakci et al., 2022). The few studies mentioned are not discussed in terms of the methods used or insights gained.

      We acknowledge the need for a more comprehensive review of prior EEG-fMRI studies investigating BOLD correlates of slow oscillations and spindles. However, these articles are not all related to sleep SO or spindle. Articles (Hale et al., 2016; Horovitz et al., 2008; Laufs, 2008; Laufs, Walker, & Lund, 2007; Spoormaker et al., 2010) mainly focus on methodology for EEG-fMRI, sleep stages, or brain networks, which are not the focus of our study. Thank you again for your attention to the comprehensiveness of our literature review, and we will expand the introduction to include a more detailed discussion of the existing literature, ensuring that the contributions of previous EEG-fMRI sleep studies are adequately acknowledged.

      Introduction, Page 4 Lines 62-76

      “Investigating these sleep-related neural processes in humans is challenging because it requires tracking transient sleep rhythms while simultaneously assessing their widespread brain activation. Recent advances in simultaneous EEG-fMRI techniques provide a unique opportunity to explore these processes. EEG allows for precise event-based detection of neural signal, while fMRI provides insight into the broader spatial patterns of brain activation and functional connectivity (Horovitz et al., 2008; Huang et al., 2024; Laufs, 2008; Laufs, Walker, & Lund, 2007; Schabus et al., 2007; Spoormaker et al., 2010). Previous EEG-fMRI studies on sleep have focused on classifying sleep stages or examining the neural correlates of specific waves (Bergmann et al., 2012; Caporro et al., 2012; Czisch et al., 2009; Fogel et al., 2017; Hale et al., 2016; Ilhan-Bayrakcı et al., 2022; Moehlman et al., 2019; Picchioni et al., 2011). These studies have generally reported that slow oscillations are associated with widespread cortical and subcortical BOLD changes, whereas spindles elicit activation in the thalamus, as well as in several cortical and paralimbic regions. Although these findings provide valuable insights into the BOLD correlates of sleep rhythms, they often do not employ sophisticated temporal modeling (Huang et al., 2024), to capture the dynamic interactions between different oscillatory events, e.g., the coupling between SOs and spindles.”

      (2) The paper falls short in discussing the specific insights gained into the neurobiological substrate of the investigated slow oscillations, spindles, and their interactions. The validity of the inverse inference approach ("Open ended cognitive state decoding"), assuming certain cognitive functions to be related to these oscillations because of the brain regions/networks activated in temporal association with these events, is debatable at best. It is also unclear why eventually only episodic memory processing-like brain-wide activation is discussed further, despite the activity of 16 of 50 feature terms from the NeuroSynth v3 dataset were significant (episodic memory, declarative memory, working memory, task representation, language, learning, faces, visuospatial processing, category recognition, cognitive control, reading, cued attention, inhibition, and action).

      Thank you for pointing this out, particularly regarding the use of inverse inference approaches such as “open-ended cognitive state decoding.” Given the concerns about the indirectness of this approach, we decided to remove its related content and results from Figure 3 in the main text and include it in Supplementary Figure 7. We will refocus the main text on direct neurobiological insights gained from our EEG-fMRI analyses, particularly emphasizing the hippocampal-thalamocortical network dynamics underlying SO-spindle coupling, and we will acknowledge the exploratory nature of these findings and highlight their limitations.

      Discussion, Page 17-18 Lines 323-332

      “To explore functional relevance, we employed an open-ended cognitive state decoding approach using meta-analytic data (NeuroSynth: Yarkoni et al. (2011)). Although this method usefully generates hypotheses about potential cognitive processes, particularly in the absence of a pre- and post-sleep memory task, it is inherently indirect. Many cognitive terms showed significant associations (16 of 50), such as “episodic memory,” “declarative memory,” and “working memory.” We focused on episodic/declarative memory given the known link with hippocampal reactivation (Diekelmann & Born, 2010; Staresina et al., 2015; Staresina et al., 2023). Nonetheless, these inferences regarding memory reactivation should be interpreted cautiously without direct behavioral measures. Future research incorporating explicit tasks before and after sleep would more rigorously validate these potential functional claims.”

      (3) Hippocampal activation during SO-spindles is stated as a main hypothesis of the paper - for good reasons - however, other regions (e.g., several cortical as well as thalamic) would be equally expected given the known origin of both oscillations and the existing sleep-EEG-fMRI literature. However, this focus on the hippocampus contrasts with the focus on investigating the key role of the thalamus instead in the Results section.

      We appreciate your insight regarding the relative emphasis on hippocampal and thalamic activation in our study. We recognize that the manuscript may currently present an inconsistency between our initial hypothesis and the main focus of the results. To address this concern, we will ensure that our Introduction and Discussion section explicitly discusses both regions, highlighting the complementary roles of the hippocampus (memory processing and reactivation) and the thalamus (spindle generation and cortico-hippocampal coordination) in SO-spindle dynamics.

      Introduction, Page 5 Lines 87-103

      “To address this gap, our study investigates brain-wide activation and functional connectivity patterns associated with SO-spindle coupling, and employs a cognitive state decoding approach (Margulies et al., 2016; Yarkoni et al., 2011)—albeit indirectly—to infer potential cognitive functions. In the current study, we used simultaneous EEG-fMRI recordings during nocturnal naps (detailed sleep staging results are provided in the Methods and Table S1) in 107 participants. Although directly detecting hippocampal ripples using scalp EEG or fMRI is challenging, we expected that hippocampal activation in fMRI would coincide with SO-spindle coupling detected by EEG, given that SOs, spindles, and ripples frequently co-occur during NREM sleep. We also anticipated a critical role of the thalamus, particularly thalamic spindles, in coordinating hippocampal-cortical communication.

      We found significant coupling between SOs and spindles during NREM sleep (N2/3), with spindle peaks occurring slightly before the SO peak. This coupling was associated with increased activation in both the thalamus and hippocampus, with functional connectivity patterns suggesting thalamic coordination of hippocampal-cortical communication. These findings highlight the key role of the thalamus in coordinating hippocampal-cortical interactions during human sleep and provide new insights into the neural mechanisms underlying sleep-dependent brain communication. A deeper understanding of these mechanisms may contribute to future neuromodulation approaches aimed at enhancing sleep-dependent cognitive function and treating sleep-related disorders.”

      Discussion, Page 16-17 Lines 292-307

      “When modeling the timing of these sleep rhythms in the fMRI, we observed hippocampal activation selectively during SO-spindle events. This suggests the possibility of triple coupling (SOs–spindles–ripples), even though our scalp EEG was not sufficiently sensitive to detect hippocampal ripples—key markers of memory replay (Buzsáki, 2015). Recent iEEG evidence indicates that ripples often co-occur with both spindles (Ngo, Fell, & Staresina, 2020) and SOs (Staresina et al., 2015; Staresina et al., 2023). Therefore, the hippocampal involvement during SO-spindle events in our study may reflect memory replay from the hippocampus, propagated via thalamic spindles to distributed cortical regions.

      The thalamus, known to generate spindles (Halassa et al., 2011), plays a key role in producing and coordinating sleep rhythms (Coulon, Budde, & Pape, 2012; Crunelli et al., 2018), while the hippocampus is found essential for memory consolidation (Buzsáki, 2015; Diba & Buzsá ki, 2007; Singh, Norman, & Schapiro, 2022). The increased hippocampal and thalamic activity, along with strengthened connectivity between these regions and the mPFC during SO-spindle events, underscores a hippocampal-thalamic-neocortical information flow. This aligns with recent findings suggesting the thalamus orchestrates neocortical oscillations during sleep (Schreiner et al., 2022). The thalamus and hippocampus thus appear central to memory consolidation during sleep, guiding information transfer to the neocortex, e.g., mPFC.”

      (4) The study included an impressive number of 107 subjects. It is surprising though that only 31 subjects had to be excluded under these difficult recording conditions, especially since no adaptation night was performed. Since only subjects were excluded who slept less than 10 min (or had excessive head movements) there are likely several datasets included with comparably short durations and only a small number of SOs and spindles and even less combined SO-spindle events. A comprehensive table should be provided (supplement) including for each subject (included and excluded) the duration of included NREM sleep, number of SOs, spindles, and SO+spindle events. Also, some descriptive statistics (mean/SD/range) would be helpful.

      We appreciate your recognition of our sample size and the challenges associated with simultaneous EEG-fMRI sleep recordings. We acknowledge the importance of transparently reporting individual subject data, particularly regarding sleep duration and the number of detected SOs, spindles, and SO-spindle events. To address this, we will provide comprehensive tables in the supplementary materials, contains descriptive information about sleep-related characteristics (Table S1), as well as detailed information about sleep waves at each sleep stage for all 107 subjects(Table S2-S4), listing for each subject:(1)Different sleep stage duration; (2)Number of detected SOs; (3)Number of detected spindles; (4)Number of detected SO-spindle coupling events; (5)Density of detected SOs; (6)Density of detected spindles; (7)Density of detected SO-spindle coupling events.

      However, most of the excluded participants were unable to fall asleep or had too short a sleep duration, so they basically had no NREM sleep period, so it was impossible to count the NREM sleep duration, SO, spindle, and coupling numbers.

      Supplementary Materials, Page 42-54, Table S1-S4

      (Consider of the length, we do not list all the tables here. Please refer to the revised manuscript.)

      (5) Was the 20-channel head coil dedicated for EEG-fMRI measurements? How were the electrode cables guided through/out of the head coil? Usually, the 64-channel head coil is used for EEG-fMRI measurements in a Siemens PRISMA 3T scanner, which has a cable duct at the back that allows to guide the cables straight out of the head coil (to minimize MR-related artifacts). The choice for the 20-channel head coil should be motivated. Photos of the recording setup would also be helpful.

      Thank you for your comment regarding our choice of the 20-channel head coil for EEG-fMRI measurements. We acknowledge that the 64-channel head coil is commonly used in Siemens PRISMA 3T scanners; however, the 20-channel coil was selected due to specific practical and technical considerations in our study. In particular, the 20-channel head coil was compatible with our EEG system and ensured sufficient signal-to-noise ratio (SNR) for both EEG and fMRI acquisition. The EEG electrode cables were guided through the lateral and posterior openings of the head coil, secured with foam padding to reduce motion and minimize MR-related artifacts. Moreover, given the extended nature of nocturnal sleep recordings, the 20-channel coil allowed us to maintain participant comfort while still achieving high-quality simultaneous EEG-fMRI data.

      We have made this clearer in the revised manuscript.

      Methods, Page 20 Lines 385-392

      “All MRI data were acquired using a 20-channel head coil on a research-dedicated 3-Tesla Siemens Magnetom Prisma MRI scanner. Earplugs and cushions were provided for noise protection and head motion restriction. We chose the 20-channel head coil because it was compatible with our EEG system and ensured sufficient signal-to-noise ratio (SNR) for both EEG and fMRI acquisition. The EEG electrode cables were guided through the lateral and posterior openings of the head coil, secured with foam padding to reduce motion and minimize MR-related artifacts. Moreover, given the extended nature of nocturnal sleep recordings, the 20-channel coil helped maintain participant comfort while still achieving high-quality simultaneous EEG-fMRI data.”

      (6) Was the EEG sampling synchronized to the MR scanner (gradient system) clock (the 10 MHz signal; not referring to the volume TTL triggers here)? This is a requirement for stable gradient artifact shape over time and thus accurate gradient noise removal.

      Thank you for raising this important point. We confirm that the EEG sampling was synchronized to the MR scanner’s 10 MHz gradient system clock, ensuring a stable gradient artifact shape over time and enabling accurate artifact removal. This synchronization was achieved using the standard clock synchronization interface of the EEG amplifier, minimizing timing jitter and drift. As a result, the gradient artifact waveform remained stable across volumes, allowing for more effective artifact correction during preprocessing. We appreciate your attention to this critical aspect of EEG-fMRI data acquisition.

      We have made this clearer in the revised manuscript.

      Methods, Page 19-20 Lines 371-383

      “EEG was recorded simultaneously with fMRI data using an MR-compatible EEG amplifier system (BrainAmps MR-Plus, Brain Products, Germany), along with a specialized electrode cap. The recording was done using 64 channels in the international 10/20 system, with the reference channel positioned at FCz. In order to adhere to polysomnography (PSG) recording standards, six electrodes were removed from the EEG cap: one for electrocardiogram (ECG) recording, two for electrooculogram (EOG) recording, and three for electromyogram (EMG) recording. EEG data was recorded at a sample rate of 5000 Hz, the resistance of the reference and ground channels was kept below 10 kΩ, and the resistance of the other channels was kept below 20 kΩ. To synchronize the EEG and fMRI recordings, the BrainVision recording software (BrainProducts, Germany) was utilized to capture triggers from the MRI scanner. The EEG sampling was synchronized to the MR scanner’s 10 MHz gradient system clock, ensuring a stable gradient artifact shape over time and enabling accurate artifact removal. This was achieved via the standard clock synchronization interface of the EEG amplifier, minimizing timing jitter and drift.”

      (7) The TR is quite long and the voxel size is quite large in comparison to state-of-the-art EPI sequences. What was the rationale behind choosing a sequence with relatively low temporal and spatial resolution?

      We acknowledge that our chosen TR and voxel size are relatively long and large compared to state-of-the-art EPI sequences. This decision was made to optimize the signal-to-noise ratio (SNR) and reduce susceptibility-related distortions, which are particularly critical in EEG-fMRI sleep studies where head motion and physiological noise can be substantial. A longer TR allowed us to sample whole-brain activity with sufficient coverage, while a larger voxel size helped enhance BOLD sensitivity and minimize partial volume effects in deep brain structures such as the thalamus and hippocampus, which are key regions of interest in our study. We appreciate your concern and hope this clarification provides sufficient rationale for our sequence parameters.

      We have made this clearer in the revised manuscript.

      Methods, Page 20-21 Lines 398-408

      “Then, the “sleep” session began after the participants were instructed to try and fall asleep. For the functional scans, whole-brain images were acquired using k-space and steady-state T2*-weighted gradient echo-planar imaging (EPI) sequence that is sensitive to the BOLD contrast. This measures local magnetic changes caused by changes in blood oxygenation that accompany neural activity (sequence specification: 33 slices in interleaved ascending order, TR = 2000 ms, TE = 30 ms, voxel size = 3.5 × 3.5 × 4.2 mm<sup>3</sup>, FA = 90°, matrix = 64 × 64, gap = 0.7 mm). A relatively long TR and larger voxel size were chosen to optimize SNR and reduce susceptibility-related distortions, which are critical in EEG-fMRI sleep studies where head motion and physiological noise can be substantial. The longer TR allowed whole-brain coverage with sufficient temporal resolution, while the larger voxel size helped enhance BOLD sensitivity and minimize partial volume effects in deep brain structures (e.g., the thalamus and hippocampus), which are key regions of interest in this study.”

      (8) The anatomically defined ROIs are quite large. It should be elaborated on how this might reduce sensitivity to sleep rhythm-specific activity within sub-regions, especially for the thalamus, which has distinct nuclei involved in sleep functions.

      We appreciate your insight regarding the use of anatomically defined ROIs and their potential limitations in detecting sleep rhythm-specific activity within sub-regions, particularly in the thalamus. Given the distinct functional roles of thalamic nuclei in sleep processes, we acknowledge that using a single, large thalamic ROI may reduce sensitivity to localized activity patterns. To address this, we will discuss this limitation in the revised manuscript, acknowledging that our approach prioritizes whole-structure effects but may not fully capture nucleus-specific contributions.

      Discussion, Page 18 Lines 333-341

      “Despite providing new insights, our study has several limitations. First, our scalp EEG did not directly capture hippocampal ripples, preventing us from conclusively demonstrating triple coupling. Second, the combination of EEG-fMRI and the lack of a memory task limit our ability to parse fine-grained BOLD responses at the DOWN- vs. UP-states of SOs and link observed activations to behavioral outcomes. Third, the use of large anatomical ROIs may mask subregional contributions of specific thalamic nuclei or hippocampal subfields. Finally, without a memory task, we cannot establish a direct behavioral link between sleep-rhythm-locked activation and memory consolidation. Future studies combining techniques such as ultra-high-field fMRI or iEEG with cognitive tasks may refine our understanding of subregional network dynamics and functional significance during sleep.”

      (9) The study reports SO & spindle amplitudes & densities, as well as SO+spindle coupling, to be larger during N2/3 sleep compared to N1 and REM sleep, which is trivial but can be seen as a sanity check of the data. However, the amount of SOs and spindles reported for N1 and REM sleep is concerning, as per definition there should be hardly any (if SOs or spindles occur in N1 it becomes by definition N2, and the interval between spindles has to be considerably large in REM to still be scored as such). Thus, on the one hand, the report of these comparisons takes too much space in the main manuscript as it is trivial, but on the other hand, it raises concerns about the validity of the scoring.

      We appreciate your concern regarding the reported presence of SOs and spindles in N1 and REM sleep and the potential implications. Our detection method for detecting SO, spindle, and coupling were originally designed only for N2&N3 sleep data based on the characteristics of the data itself, and this method is widely recognized and used in the sleep research (Hahn et al., 2020; Helfrich et al., 2019; Helfrich et al., 2018; Ngo, Fell, & Staresina, 2020; Schreiner et al., 2022; Schreiner et al., 2021; Staresina et al., 2015; Staresina et al., 2023). While, because the detection methods for SO and spindle are based on percentiles, this method will always detect a certain number of events when used for other stages (N1 and REM) sleep data, but the differences between these events and those detected in stage N23 remain unclear. We will acknowledge the reasons for these results in the Methods section and emphasize that they are used only for sanity checks.

      Methods, Page 25 Lines 515-524

      “We note that the above methods for detecting SOs, spindles, and their couplings were originally developed for N2 and N3 sleep data, based on the specific characteristics of these stages. These methods are widely recognized in sleep research (Hahn et al., 2020; Helfrich et al., 2019; Helfrich et al., 2018; Ngo, Fell, & Staresina, 2020; Schreiner et al., 2022; Schreiner et al., 2021; Staresina et al., 2015; Staresina et al., 2023). However, because this percentile-based detection approach will inherently identify a certain number of events if applied to other stages (e.g., N1 and REM), the nature of these events in those stages remains unclear compared to N2/N3. We nevertheless identified and reported the detailed descriptive statistics of these sleep rhythms in all sleep stages, under the same operational definitions, both for completeness and as a sanity check. Within the same subject, there should be more SOs, spindles, and their couplings in N2/N3 than in N1 or REM (see also Figure S2-S4, Table S1-S4).”

      (10) Why was electrode F3 used to quantify the occurrence of SOs and spindles? Why not a midline frontal electrode like Fz (or a number of frontal electrodes for SOs) and Cz (or a number of centroparietal electrodes) for spindles to be closer to their maximum topography?

      We appreciate your suggestion regarding electrode selection for SO and spindle quantification. Our choice of F3 was primarily based on previous studies (Massimini et al., 2004; Molle et al., 2011), where bilateral frontal electrodes are commonly used for detecting SOs and spindles. Additionally, we considered the impact of MRI-related noise and, after a comprehensive evaluation, determined that F3 provided an optimal balance between signal quality and artifact minimization. We also acknowledge that alternative electrode choices, such as Fz for SOs and Cz for spindles, could provide additional insights into their topographical distributions.

      (11) Functional connectivity (hippocampus -> thalamus -> cortex (mPFC)) is reported to be increased during SO-spindle coupling and interpreted as evidence for coordination of hippocampo-neocortical communication likely by thalamic spindles. However, functional connectivity was only analysed during coupled SO+spindle events, not during isolated SOs or isolated spindles. Without the direct comparison of the connectivity patterns between these three events, it remains unclear whether this is specific for coupled SO+spindle events or rather associated with one or both of the other isolated events. The PPIs need to be conducted for those isolated events as well and compared statistically to the coupled events.

      We appreciate your critical perspective on our functional connectivity analysis and the interpretation of hippocampus-thalamus-cortex (mPFC) interactions during SO-spindle coupling. We acknowledge that, in the current analysis, functional connectivity was only examined during coupled SO-spindle events, without direct comparison to isolated SOs or isolated spindles. To address this concern, we have conducted PPI analyses for all three ROIs(Hippocampus, Thalamus, mPFC) and all three event types (SO-spindle couplings, isolated SOs, and isolated spindles). Our results indicate that neither isolated SOs nor isolated Spindles yielded significant connectivity changes in all three ROIs, as all failed to survive multiple comparison corrections. This suggests that the observed connectivity increase is specific to SO-spindle coupling, rather than being independently driven by either SOs or spindles alone.

      Results, Page 14 Lines 248-255

      “Crucially, the interaction between FC and SO-spindle coupling revealed that only the functional connectivity of hippocampus -> thalamus (ROI analysis, t<sub>(106)</sub> = 1.86, p = 0.0328) and thalamus -> mPFC (ROI analysis, t<sub>(106)</sub> = 1.98, p = 0.0251) significantly increased during SO-spindle coupling, with no significant changes in all other pathways (Fig. 4e). We also conducted PPI analyses for the other two events (SOs and spindles), and neither yielded significant connectivity changes in the three ROIs, as all failed to survive whole-brain FWE correction at the cluster level (p < 0.05). Together, these findings suggest that the thalamus, likely via spindles, coordinates hippocampal-cortical communication selectively during SO-spindle coupling, but not isolated SOs or spindle events alone.”

      (12) The limited temporal resolution of fMRI does indeed not allow for easily distinguishing between fMRI activation patterns related to SO-up- vs. SO-down-states. For this, one could try to extract the amplitudes of SO-up- and SO-down-states separately for each SO event and model them as two separate parametric modulators (with the risk of collinearity as they are likely correlated).

      We appreciate your insightful comment regarding the challenge of distinguishing fMRI activation patterns related to SO-up vs. SO-down states due to the limited temporal resolution of fMRI. While our current analysis does not differentiate between these two phases, we acknowledge that separately modeling SO-up and SO-down states using parametric modulators could provide a more refined understanding of their distinct neural correlates. However, as you notes, this approach carries the risk of collinearity, and there is indeed a high correlation between the two amplitudes across all subjects in our results (r=0.98). Future studies could explore more on leveraging high-temporal-resolution techniques. While implementing this in the current study is beyond our scope, we will acknowledge this limitation in the Discussion section.

      Discussion, Page 17 Lines 308-322

      “An intriguing aspect of our findings is the reduced DMN activity during SOs when modeled at the SO trough (DOWN-state). This reduced DMN activity may reflect large-scale neural inhibition characteristic of the SO trough. The DMN is typically active during internally oriented cognition (e.g., self-referential processing or mind-wandering) and is suppressed during external stimuli processing (Yeshurun, Nguyen, & Hasson, 2021). It is unlikely, however, that this suppression of DMN during SO events is related to a shift from internal cognition to external responses given it is during deep sleep time. Instead, it could be driven by the inherent rhythmic pattern of SOs, which makes it difficult to separate UP- from DOWN-states (the two temporal regressors were highly correlated, and similar brain activation during SOs events was obtained if modelled at the SO peak instead, Fig. S5). Since the amplitude at the SO trough is consistently larger than that at the SO peak, the neural activation we detected may primarily capture the large-scale inhibition from DOWN-state. Interestingly, no such DMN reduction was found during SO-spindle coupling, implying that coupling may involve distinct neural dynamics that partially re-engage DMN-related processes, possibly reflecting memory-related reactivation. Future research using high-temporal-resolution techniques like iEEG could clarify these possibilities.

      Discussion, Page 18 Lines 333-341

      “Despite providing new insights, our study has several limitations. First, our scalp EEG did not directly capture hippocampal ripples, preventing us from conclusively demonstrating triple coupling. Second, the combination of EEG-fMRI and the lack of a memory task limit our ability to parse fine-grained BOLD responses at the DOWN- vs. UP-states of SOs and link observed activations to behavioral outcomes. Third, the use of large anatomical ROIs may mask subregional contributions of specific thalamic nuclei or hippocampal subfields. Finally, without a memory task, we cannot establish a direct behavioral link between sleep-rhythm-locked activation and memory consolidation. Future studies combining techniques such as ultra-high-field fMRI or iEEG with cognitive tasks may refine our understanding of subregional network dynamics and functional significance during sleep.

      (13) L327: "It is likely that our findings of diminished DMN activity reflect brain activity during the SO DOWN-state, as this state consistently shows higher amplitude compared to the UP-state within subjects, which is why we modelled the SO trough as its onset in the fMRI analysis." This conclusion is not justified as the fact that SO down-states are larger in amplitude does not mean their impact on the BOLD response is larger.

      We appreciate your concern regarding our interpretation of diminished DMN activity reflecting the SO down-state. We acknowledge that the current expression is somewhat misleading, and our interpretation of it is: it could be driven by the inherent rhythmic pattern of SOs, which makes it difficult to separate UP- from DOWN-states (the two temporal regressors were highly correlated, and similar brain activation during SOs events was obtained if modelled at the SO peak instead). Since the amplitude at the SO trough is consistently larger than that at the SO peak, the neural activation we detected may primarily capture the large-scale inhibition from DOWN-state. And we will make this clear in the Discussion section.

      Discussion, Page 17 Lines 308-322

      “An intriguing aspect of our findings is the reduced DMN activity during SOs when modeled at the SO trough (DOWN-state). This reduced DMN activity may reflect large-scale neural inhibition characteristic of the SO trough. The DMN is typically active during internally oriented cognition (e.g., self-referential processing or mind-wandering) and is suppressed during external stimuli processing (Yeshurun, Nguyen, & Hasson, 2021). It is unlikely, however, that this suppression of DMN during SO events is related to a shift from internal cognition to external responses given it is during deep sleep time. Instead, it could be driven by the inherent rhythmic pattern of SOs, which makes it difficult to separate UP- from DOWN-states (the two temporal regressors were highly correlated, and similar brain activation during SOs events was obtained if modelled at the SO peak instead, Fig. S5). Since the amplitude at the SO trough is consistently larger than that at the SO peak, the neural activation we detected may primarily capture the large-scale inhibition from DOWN-state. Interestingly, no such DMN reduction was found during SO-spindle coupling, implying that coupling may involve distinct neural dynamics that partially re-engage DMN-related processes, possibly reflecting memory-related reactivation. Future research using high-temporal-resolution techniques like iEEG could clarify these possibilities.

      (14) Line 77: "In the current study, while directly capturing hippocampal ripples with scalp EEG or fMRI is difficult, we expect to observe hippocampal activation in fMRI whenever SOs-spindles coupling is detected by EEG, if SOs- spindles-ripples triple coupling occurs during human NREM sleep". Not all SO-spindle events are associated with ripples (Staresina et al., 2015), but hippocampal activation may also be expected based on the occurrence of spindles alone (Bergmann et al., 2012).

      We appreciate your clarification regarding the relationship between SO-spindle coupling and hippocampal ripples. We acknowledge that not all SO-spindle events are necessarily accompanied by ripples (Staresina et al., 2015). However, based on previous research, we found that hippocampal ripples are significantly more likely to occur during SO-spindle coupling events. This suggests that while ripple occurrence is not guaranteed, SO-spindle coupling creates a favorable network state for ripple generation and potential hippocampal activation. To ensure accuracy, we will revise the manuscript to delete this misleading sentence in the Introduction section and acknowledge in the Discussion that our results cannot conclusively directly observe the triple coupling of SO, spindle, and hippocampal ripples.

      Discussion, Page 18 Lines 333-341

      “Despite providing new insights, our study has several limitations. First, our scalp EEG did not directly capture hippocampal ripples, preventing us from conclusively demonstrating triple coupling. Second, the combination of EEG-fMRI and the lack of a memory task limit our ability to parse fine-grained BOLD responses at the DOWN- vs. UP-states of SOs and link observed activations to behavioral outcomes. Third, the use of large anatomical ROIs may mask subregional contributions of specific thalamic nuclei or hippocampal subfields. Finally, without a memory task, we cannot establish a direct behavioral link between sleep-rhythm-locked activation and memory consolidation. Future studies combining techniques such as ultra-high-field fMRI or iEEG with cognitive tasks may refine our understanding of subregional network dynamics and functional significance during sleep.”

      Reviewer #2 (Public review):

      In this study, Wang and colleagues aimed to explore brain-wide activation patterns associated with NREM sleep oscillations, including slow oscillations (SOs), spindles, and SO-spindle coupling events. Their findings reveal that SO-spindle events corresponded with increased activation in both the thalamus and hippocampus. Additionally, they observed that SO-spindle coupling was linked to heightened functional connectivity from the hippocampus to the thalamus, and from the thalamus to the medial prefrontal cortex-three key regions involved in memory consolidation and episodic memory processes.

      This study's findings are timely and highly relevant to the field. The authors' extensive data collection, involving 107 participants sleeping in an fMRI while undergoing simultaneous EEG recording, deserves special recognition. If shared, this unique dataset could lead to further valuable insights. While the conclusions of the data seem overall well supported by the data, some aspects with regard to the detection of sleep oscillations need clarification.

      The authors report that coupled SO-spindle events were most frequent during NREM sleep (2.46 [plus minus] 0.06 events/min), but they also observed a surprisingly high occurrence of these events during N1 and REM sleep (2.23 [plus minus] 0.09 and 2.32 [plus minus] 0.09 events/min, respectively), where SO-spindle coupling would not typically be expected. Combined with the relatively modest SO amplitudes reported (~25 µV, whereas >75 µV would be expected when using mastoids as reference electrodes), this raises the possibility that the parameters used for event detection may not have been conservative enough - or that sleep staging was inaccurately performed. This issue could present a significant challenge, as the fMRI findings are largely dependent on the reliability of these detected events.

      Thank you very much for your thorough and encouraging review. We appreciate your recognition of the significance and relevance of our study and dataset, particularly in highlighting how simultaneous EEG-fMRI recordings can provide complementary insights into the temporal dynamics of neural oscillations and their associated spatial activation patterns during sleep. In the sections that follow, we address each of your comments in detail. We have revised the text and conducted additional analyses wherever possible to strengthen our argument, clarify our methodological choices. We believe these revisions improve the clarity and rigor of our work, and we thank you for helping us refine it.

      We appreciate your insightful comments regarding the detection of sleep oscillations. Our methods for detecting SOs, spindles, and their couplings were originally developed for N2 and N3 sleep data, based on the specific characteristics of these stages. These methods are widely recognized in sleep research (Hahn et al., 2020; Helfrich et al., 2019; Helfrich et al., 2018; Ngo, Fell, & Staresina, 2020; Schreiner et al., 2022; Schreiner et al., 2021; Staresina et al., 2015; Staresina et al., 2023). However, because this percentile-based detection approach will inherently identify a certain number of events if applied to other stages (e.g., N1 and REM), the nature of these events in those stages remains unclear compared to N2/N3. We nevertheless identified and reported the detailed descriptive statistics of these sleep rhythms in all sleep stages, under the same operational definitions, both for completeness and as a sanity check. Within the same subject, there should be more SOs, spindles, and their couplings in N2/N3 than in N1 or REM. We will acknowledge the reasons for these results in the Methods section and emphasize that they are used only for sanity checks.

      Regarding the reported SO amplitudes (~25 µV), during preprocessing, we applied the Signal Space Projection (SSP) method to more effectively remove MRI gradient artifacts and cardiac pulse noise. While this approach enhances data quality, it also reduces overall signal power, leading to systematically lower reported amplitudes. Despite this, our SO detection in NREM sleep (especially N2/N3) remain physiologically meaningful and are consistent with previous fMRI studies using similar artifact removal techniques. We appreciate your careful evaluation and valuable suggestions.

      In addition, we will provide comprehensive tables in the supplementary materials, contains descriptive information about sleep-related characteristics (Table S1), as well as detailed information about sleep waves at each sleep stage for all 107 subjects(Table S2-S4), listing for each subject:(1)Different sleep stage duration; (2)Number of detected SOs; (3)Number of detected spindles; (4)Number of detected SO-spindle coupling events; (2)Density of detected SOs; (3)Density of detected spindles; (4)Density of detected SO-spindle coupling events.

      Methods, Page 25 Lines 515-524

      “We note that the above methods for detecting SOs, spindles, and their couplings were originally developed for N2 and N3 sleep data, based on the specific characteristics of these stages. These methods are widely recognized in sleep research (Hahn et al., 2020; Helfrich et al., 2019; Helfrich et al., 2018; Ngo, Fell, & Staresina, 2020; Schreiner et al., 2022; Schreiner et al., 2021; Staresina et al., 2015; Staresina et al., 2023). However, because this percentile-based detection approach will inherently identify a certain number of events if applied to other stages (e.g., N1 and REM), the nature of these events in those stages remains unclear compared to N2/N3. We nevertheless identified and reported the detailed descriptive statistics of these sleep rhythms in all sleep stages, under the same operational definitions, both for completeness and as a sanity check. Within the same subject, there should be more SOs, spindles, and their couplings in N2/N3 than in N1 or REM (see also Figure S2-S4, Table S1-S4).”

      Supplementary Materials, Page 42-54, Table S1-S4

      (Consider of the length, we do not list all the tables here. Please refer to the revised manuscript.)

      Reviewer #3 (Public review):

      Summary:

      Wang et al., examined the brain activity patterns during sleep, especially when locked to those canonical sleep rhythms such as SO, spindle, and their coupling. Analyzing data from a large sample, the authors found significant coupling between spindles and SOs, particularly during the upstate of the SO. Moreover, the authors examined the patterns of whole-brain activity locked to these sleep rhythms. To understand the functional significance of these brain activities, the authors further conducted open-ended cognitive state decoding and found a variety of cognitive processing may be involved during SO-spindle coupling and during other sleep events. The authors next investigated the functional connectivity analyses and found enhanced connectivity between the hippocampus, the thalamus, and the medial PFC. These results reinforced the theoretical model of sleep-dependent memory consolidation, such that SO-spindle coupling is conducive to systems-level memory reactivation and consolidation.

      Strengths:

      There are obvious strengths in this work, including the large sample size, state-of-the-art neuroimaging and neural oscillation analyses, and the richness of results.

      Weaknesses:

      Despite these strengths and the insights gained, there are weaknesses in the design, the analyses, and inferences.

      Thank you for your detailed and thoughtful review of our manuscript. We are delighted that you recognize our advanced analysis methods and rich results of neuroimaging and neural oscillations as well as the large sample size data. In the following sections, we provide detailed responses to each of your comments. And we have revised the text and conducted additional analyses to strengthen our arguments and clarify our methodological choices. We believe these revisions enhance the clarity and rigor of our work, and we sincerely appreciate your thoughtful feedback in helping us refine the manuscript.

      (1) A repeating statement in the manuscript is that brain activity could indicate memory reactivation and thus consolidation. This is indeed a highly relevant question that could be informed by the current data/results. However, an inherent weakness of the design is that there is no memory task before and after sleep. Thus, it is difficult (if not impossible) to make a strong argument linking SO/spindle/coupling-locked brain activity with memory reactivation or consolidation.

      We appreciate your suggestion regarding the lack of a pre- and post-sleep memory task in our study design. We acknowledge that, in the absence of behavioral measures, it is hard to directly link SO-spindle coupling to memory consolidation in an outcome-driven manner. Our interpretation is instead based on the well-established role of these oscillations in memory processes, as demonstrated in previous studies. We sincerely appreciate this feedback and will adjust our Discussion accordingly to reflect a more precise interpretation of our findings.

      Discussion, Page 18 Lines 333-341

      “Despite providing new insights, our study has several limitations. First, our scalp EEG did not directly capture hippocampal ripples, preventing us from conclusively demonstrating triple coupling. Second, the combination of EEG-fMRI and the lack of a memory task limit our ability to parse fine-grained BOLD responses at the DOWN- vs. UP-states of SOs and link observed activations to behavioral outcomes. Third, the use of large anatomical ROIs may mask subregional contributions of specific thalamic nuclei or hippocampal subfields. Finally, without a memory task, we cannot establish a direct behavioral link between sleep-rhythm-locked activation and memory consolidation. Future studies combining techniques such as ultra-high-field fMRI or iEEG with cognitive tasks may refine our understanding of subregional network dynamics and functional significance during sleep.”

      (2) Relatedly, to understand the functional implications of the sleep rhythm-locked brain activity, the authors employed the "open-ended cognitive state decoding" method. While this method is interesting, it is rather indirect given that there were no behavioral indices in the manuscript. Thus, discussions based on these analyses are speculative at best. Please either tone down the language or find additional evidence to support these claims.

      Moreover, the results from this method are difficult to understand. Figure 3e showed that for all three types of sleep events (SO, spindle, SO-spindle), the same mental states (e.g., working memory, episodic memory, declarative memory) showed opposite directions of activation (left and right panels showed negative and positive activation, respectively). How to interpret these conflicting results? This ambiguity is also reflected by the term used: declarative memory and episodic memories are both indexed in the results. Yet these two processes can be largely overlapped. So which specific memory processes do these brain activity patterns reflect? The Discussion shall discuss these results and the limitations of this method.

      We appreciate your critical assessment of the open-ended cognitive state decoding method and its interpretational challenges. Given the concerns about the indirectness of this approach, we decided to remove its related content and results from Figure 3 in the main text and include it in Supplementary Figure 7.

      Due to the complexity of memory-related processes, we acknowledge that distinguishing between episodic and declarative memory based solely on this approach is not straightforward. We will revise the Supplementary Materials to explicitly discuss these limitations and clarify that our findings do not isolate specific cognitive processes but rather suggest general associations with memory-related networks.

      Discussion, Page 17-18 Lines 323-332

      “To explore functional relevance, we employed an open-ended cognitive state decoding approach using meta-analytic data (NeuroSynth: Yarkoni et al. (2011)). Although this method usefully generates hypotheses about potential cognitive processes, particularly in the absence of a pre- and post-sleep memory task, it is inherently indirect. Many cognitive terms showed significant associations (16 of 50), such as “episodic memory,” “declarative memory,” and “working memory.” We focused on episodic/declarative memory given the known link with hippocampal reactivation (Diekelmann & Born, 2010; Staresina et al., 2015; Staresina et al., 2023). Nonetheless, these inferences regarding memory reactivation should be interpreted cautiously without direct behavioral measures. Future research incorporating explicit tasks before and after sleep would more rigorously validate these potenial functional claims.”

      (3) The coupling strength is somehow inconsistent with prior results (Hahn et al., 2020, eLife, Helfrich et al., 2018, Neuron). Specifically, Helfrich et al. showed that among young adults, the spindle is coupled to the peak of the SO. Here, the authors reported that the spindles were coupled to down-to-up transitions of SO and before the SO peak. It is possible that participants' age may influence the coupling (see Helfrich et al., 2018). Please discuss the findings in the context of previous research on SO-spindle coupling.

      We appreciate your concern regarding the temporal characteristics of SO-spindle coupling. We acknowledge that the SO-spindle coupling phase results in our study are not identical to those reported by Hahn et al. (2020); Helfrich et al. (2018). However, these differences may arise due to slight variations in event detection parameters, which can influence the precise phase estimation of coupling. Notably, Hahn et al. (2020) also reported slight discrepancies in their group-level coupling phase results, highlighting that methodological differences can contribute to variability across studies. Furthermore, our findings are consistent with those of Schreiner et al. (2021), further supporting the robustness of our observations.

      That said, we acknowledge that our original description of SO-spindle coupling as occurring at the "transition from the lower state to the upper state" was not entirely precise. The -π/2 phase represents the true transition point, while our observed coupling phase is actually closer to the SO peak rather than strictly at the transition. We will revise this statement in the manuscript to ensure clarity and accuracy in describing the coupling phase.

      Discussion, Page 16 Lines 283-291

      “Our data provide insights into the neurobiological underpinnings of these sleep rhythms. SOs, originating mainly in neocortical areas such as the mPFC, alternate between DOWN- and UP-states. The thalamus generates sleep spindles, which in turn couple with SOs. Our finding that spindle peaks consistently occurred slightly before the UP-state peak of SOs (in 83 out of 107 participants), concurs with prior studies, including Schreiner et al. (2021). Yet it differs from some results suggesting spindles might peak right at the SO UP-state (Hahn et al., 2020; Helfrich et al., 2018). Such discrepancies could arise from differences in detection algorithms, participant age (Helfrich et al., 2018), or subtle variations in cortical-thalamic timing. Nonetheless, these results underscore the importance of coordinated SO-spindle interplay in supporting sleep-dependent processes.”

      (4) The discussion is rather superficial with only two pages, without delving into many important arguments regarding the possible functional significance of these results. For example, the author wrote, "This internal processing contrasts with the brain patterns associated with external tasks, such as working memory." Without any references to working memory, and without delineating why WM is considered as an external task even working memory operations can be internal. Similarly, for the interesting results on SO and reduced DMN activity, the authors wrote "The DMN is typically active during wakeful rest and is associated with self-referential processes like mind-wandering, daydreaming, and task representation (Yeshurun, Nguyen, & Hasson, 2021). Its reduced activity during SOs may signal a shift towards endogenous processes such as memory consolidation." This argument is flawed. DMN is active during self-referential processing and mind-wandering, i.e., when the brain shifts from external stimuli processing to internal mental processing. During sleep, endogenous memory reactivation and consolidation are also part of the internal mental processing given the lack of external environmental stimulation. So why during SO or during memory consolidation, the DMN activity would be reduced? Were there differences in DMN activity between SO and SO-spindle coupling events?

      We appreciate your concerns regarding the brevity of the discussion and the need for clearer theoretical arguments. We will expand this section to provide more in-depth interpretations of our findings in the context of prior literature. Regarding working memory (WM), we acknowledge that our phrasing was ambiguous. We will modify this statement in the Discussion section.

      For the SO-related reduction in DMN activity, we recognize the need for a more precise explanation. This reduced DMN activity may reflect large-scale neural inhibition characteristic of the SO trough. The DMN is typically active during internally oriented cognition (e.g., self-referential processing or mind-wandering) and is suppressed during external stimuli processing (Yeshurun, Nguyen, & Hasson, 2021). It is unlikely, however, that this suppression of DMN during SO events is related to a shift from internal cognition to external responses given it is during deep sleep time. Instead, it could be driven by the inherent rhythmic pattern of SOs, which makes it difficult to separate UP- from DOWN-states (the two temporal regressors were highly correlated, and similar brain activation during SOs events was obtained if modelled at the SO peak instead). Since the amplitude at the SO trough is consistently larger than that at the SO peak, the neural activation we detected may primarily capture the large-scale inhibition from DOWN-state.

      To address your final question, we have conducted the additional post hoc comparison of DMN activity between isolated SOs and SO-spindle coupling events. Our results indicate that

      DMN activation during SOs was significantly lower than during SO-spindle coupling (t<sub>(106)</sub> = -4.17, p < 1e-4). This suggests that SO-spindle coupling may involve distinct neural dynamics that partially re-engage DMN-related processes, possibly reflecting memory-related reactivation. We appreciate your constructive feedback and will integrate these expanded analyses and discussions into our revised manuscript.

      Results, Page 11 Lines 199-208

      “Spindles were correlated with positive activation in the thalamus (ROI analysis, t<sub>(106)</sub> = 15.39, p < 1e-4), the anterior cingulate cortex (ACC), and the putamen, alongside deactivation in the DMN (Fig. 3c). Notably, SO-spindle coupling was linked to significant activation in both the thalamus (ROI analysis, t<sub>(106)</sub> \= 3.38, p = 0.0005) and the hippocampus (ROI analysis, t<sub>(106)</sub> \= 2.50, p = 0.0070, Fig. 3d). However, no decrease in DMN activity was found during SO-spindle coupling, and DMN activity during SO was significantly lower than during coupling (ROI analysis, t<sub>(106)</sub> \= -4.17, p < 1e-4). For more detailed activation patterns, see Table S5-S7. We also varied the threshold used to detect SO events to assess its effect on hippocampal activation during SO-spindle coupling and observed that hippocampal activation remained significant when the percentile thresholds for SO detection ranged between 71% and 80% (see Fig. S6).”

      Discussion, Page 17-18 Lines 308-332

      “An intriguing aspect of our findings is the reduced DMN activity during SOs when modeled at the SO trough (DOWN-state). This reduced DMN activity may reflect large-scale neural inhibition characteristic of the SO trough. The DMN is typically active during internally oriented cognition (e.g., self-referential processing or mind-wandering) and is suppressed during external stimuli processing (Yeshurun, Nguyen, & Hasson, 2021). It is unlikely, however, that this suppression of DMN during SO events is related to a shift from internal cognition to external responses given it is during deep sleep time. Instead, it could be driven by the inherent rhythmic pattern of SOs, which makes it difficult to separate UP- from DOWN-states (the two temporal regressors were highly correlated, and similar brain activation during SOs events was obtained if modelled at the SO peak instead, Fig. S5). Since the amplitude at the SO trough is consistently larger than that at the SO peak, the neural activation we detected may primarily capture the large-scale inhibition from DOWN-state. Interestingly, no such DMN reduction was found during SO-spindle coupling, implying that coupling may involve distinct neural dynamics that partially re-engage DMN-related processes, possibly reflecting memory-related reactivation. Future research using high-temporal-resolution techniques like iEEG could clarify these possibilities.

      To explore functional relevance, we employed an open-ended cognitive state decoding approach using meta-analytic data (NeuroSynth: Yarkoni et al. (2011)). Although this method usefully generates hypotheses about potential cognitive processes, particularly in the absence of a pre- and post-sleep memory task, it is inherently indirect. Many cognitive terms showed significant associations (16 of 50), such as “episodic memory,” “declarative memory,” and “working memory.” We focused on episodic/declarative memory given the known link with hippocampal reactivation (Diekelmann & Born, 2010; Staresina et al., 2015; Staresina et al., 2023). Nonetheless, these inferences regarding memory reactivation should be interpreted cautiously without direct behavioral measures. Future research incorporating explicit tasks before and after sleep would more rigorously validate these potential functional claims.”

      Reviewing Editor Comment:

      The reviewers think that you are working on a relevant and important topic. They are praising the large sample size used in the study. The reviewers are not all in line regarding the overall significance of the findings, but they all agree the paper would strongly benefit from some extra work, as all reviewers raise various critical points that need serious consideration.

      We appreciate your recognition of the relevance and importance of our study, as well as your acknowledgment of the large sample size as a strength of our work. We understand that there are differing perspectives regarding the overall significance of our findings, and we value the constructive critiques provided. We are committed to addressing the key concerns raised by all reviewers, including refining our analyses, clarifying our interpretations, and incorporating additional discussions to strengthen the manuscript. Below, we address your specific recommendations and provide responses to each point you raised to ensure our methods and results are as transparent and comprehensible as possible. We believe that these revisions will significantly enhance the rigor and impact of our study, and we sincerely appreciate your thoughtful feedback in helping us improve our work.

      Reviewer #1 (Recommendations for the authors):

      (1) The phrase "overnight sleep" suggests an entire night, while these were rather "nocturnal naps". Please rephrase.

      Thank you for pointing this out. We have revised the phrasing in our manuscript to "nocturnal naps" instead of "overnight sleep" to more accurately reflect the duration of the sleep recordings.

      (2) Sleep staging results (macroscopic sleep architecture) should be provided in more detail (at least min and % of the different sleep stages, sleep onset latency, total sleep duration, total recording duration), at least mean/SD/range.

      Thank you for this suggestion. We will provide comprehensive tables in the supplementary materials, contains descriptive information about sleep-related characteristics. This information will help provide a clearer overview of the macroscopic sleep architecture in our dataset.

      Supplementary Materials, Page 42, Table S1

      Author response table 1.

      Descriptive results of demographic information and sleep characteristics. Note: The total recorded time is equal to the awake time plus the total sleep time. The sleep onset latency is the time taken to reach the first sleep epoch. The Sleep Efficiency is the ratio of actual sleep time to total recording time.

      Reviewer #2 (Recommendations for the authors):

      In order to allow for a better estimation of the reliability of the detected sleep events, please:

      (1) Provide densities and absolute numbers of all detected SOs and spindles (N1, NREM, and REM sleep).

      Thank you for pointing this out. We will provide comprehensive tables in the supplementary materials, contains detailed information about sleep waves at each sleep stage for all 107 subjects (Table S2-S4), listing for each subject:1) Different sleep stage duration; 2) Number of detected SOs; 3) Number of detected spindles; 4) Number of detected SO-spindle coupling events; 5) Density of detected SOs; 6) Density of detected spindles; 7) Density of detected SO-spindle coupling events.

      Supplementary Materials, Page 43-54, Table S2-S4

      (Consider of the length, we do not list all the tables here. Please refer to the revised manuscript.)

      (2) Show ERPs for all detected SOs and spindles (per sleep stage).

      Thank you for the suggestion. We will provide ERPs for all detected SOs and spindles, separated by sleep stage (N1, N2&N3, and REM) in supplementary Fig. S2-S4. These ERP waveforms will help illustrate the characteristic temporal profiles of SOs and spindles across different sleep stages.

      Methods, Page 25, Line 525-532

      “Event-related potentials (ERP) analysis. After completing the detection of each sleep rhythm event, we performed ERP analyses for SOs, spindles, and coupling events in different sleep stages. Specifically, for SO events, we took the trough of the DOWN-state of each SO as the zero-time point, then extracted data in a [-2 s to 2 s] window from the broadband (0.1–30 Hz) EEG and used [-2 s to -0.5 s] for baseline correction; the results were then averaged across 107 subjects (see Fig. S2a). For spindle events, we used the peak of each spindle as the zero-time point and applied the same data extraction window and baseline correction before averaging across 107 subjects (see Fig. S2b). Finally, for SO-spindle coupling events, we followed the same procedure used for SO events (see Fig. 2a, Figs. S3–S4).”

      Supplementary Materials, Page 36-38, Fig. S2-S4

      Author response image 1.

      ERPs of SOs and spindles coupling during different sleep stages across all 107 subjects. a. ERP of SOs in different sleep stages using the broadband (0.1–30 Hz) EEG data. We align the trough of the DOWN-state of each SO at time zero (see Methods for details). The orange line represents the SO ERP in the N1 stage, the black line represents the SO ERP in the N2&N3 stage, and the green line represents the SO ERP in the REM stage. b. ERP of spindles in different sleep stages using the broadband (0.1–30 Hz) EEG data. We align the peak of each spindle at time zero (see Methods for details). The color scheme is the same as in panel a.

      Author response image 2.

      ERP and time-frequency patterns of SO-spindle coupling in the N1 stage. The averaged temporal frequency pattern and ERP across all instances of SO-spindle coupling, computed over all subjects, following the same procedure as in Fig. 2a, but for N1 stage.

      Author response image 3.

      ERP and time-frequency patterns of SO-spindle coupling in the REM stage. The averaged temporal frequency pattern and ERP across all instances of SO-spindle coupling, computed over all subjects, again following the same procedure as in Fig. 2a, but for REM stage.

      (3) Provide detailed info concerning sleep characteristics (time spent in each sleep stage etc.).

      Thank you for this suggestion. Same as the response above, we will provide comprehensive tables in the supplementary materials, contains descriptive information about sleep-related characteristics.

      Supplementary Materials, Page 42, Table S1 (same as above)

      (4) What would happen if more stringent parameters were used for event detection? Would the authors still observe a significant number of SO spindles during N1 and REM? Would this affect the fMRI-related results?

      Thank you for this suggestion. Our methods for detecting SOs, spindles, and their couplings were originally developed for N2 and N3 sleep data, based on the specific characteristics of these stages. These methods are widely recognized in sleep research (Hahn et al., 2020; Helfrich et al., 2019; Helfrich et al., 2018; Ngo, Fell, & Staresina, 2020; Schreiner et al., 2022; Schreiner et al., 2021; Staresina et al., 2015; Staresina et al., 2023). However, because this percentile-based detection approach will inherently identify a certain number of events if applied to other stages (e.g., N1 and REM), the nature of these events in those stages remains unclear compared to N2/N3. We nevertheless identified and reported the detailed descriptive statistics of these sleep rhythms in all sleep stages, under the same operational definitions, both for completeness and as a sanity check. Within the same subject, there should be more SOs, spindles, and their couplings in N2/N3 than in N1 or REM (see also Figure S2-S4, Table S1-S4).

      Furthermore, in order to explore the impact of this on our fMRI results, we conducted an additional sensitivity analysis by applying different detection parameters for SOs. Specifically, we adjusted amplitude percentile thresholds for SO detection (the parameter that has the greatest impact on the results). We used the hippocampal activation value during N2&N3 stage SO-spindle coupling as an anchor value and found that when the parameters gradually became stricter, the results were similar to or even better than the current results. However, when we continued to increase the threshold, the results began to gradually decrease until the threshold was increased to 80%, and the results were no longer significant. This indicates that our results are robust within a specific range of parameters, but as the threshold increases, the number of trials decreases, ultimately weakening the statistical power of the fMRI analysis.

      Thank you again for your suggestions on sleep rhythm event detection. We will add the results in Supplementary and revise our manuscript accordingly.

      Results, Page 11, Line 199-208

      “Spindles were correlated with positive activation in the thalamus (ROI analysis, t<sub>(106)</sub> = 15.39, p < 1e-4), the anterior cingulate cortex (ACC), and the putamen, alongside deactivation in the DMN (Fig. 3c). Notably, SO-spindle coupling was linked to significant activation in both the thalamus (ROI analysis, t<sub>(106)</sub> \= 3.38, p = 0.0005) and the hippocampus (ROI analysis, t<sub>(106)</sub> \= 2.50, p = 0.0070, Fig. 3d). However, no decrease in DMN activity was found during SO-spindle coupling, and DMN activity during SO was significantly lower than during coupling (ROI analysis, t<sub>(106)</sub> \= -4.17, p < 1e-4). For more detailed activation patterns, see Table S5-S7. We also varied the threshold used to detect SO events to assess its effect on hippocampal activation during SO-spindle coupling and observed that hippocampal activation remained significant when the percentile thresholds for SO detection ranged between 71% and 80% (see Fig. S6).”

      Supplementary Materials, Page 40, Fig. S6

      Author response image 4.

      Influence of the percentile threshold for SO detection on hippocampal activation (ROI) during SO-spindle coupling. We changed the percentile threshold for SO event detection in the EEG data analysis and then reconstructed the GLM design matrix based on the SO events detected at each threshold. The brain-wide activation pattern of SO-spindle couplings in the N2/3 stage was extracted using the same method as shown in Fig. 3. The gray horizontal line represents the significant range (71%–80%). * p < 0.05.

      Finally, we sincerely thank all again for your thoughtful and constructive feedback. Your insights have been invaluable in refining our analyses, strengthening our interpretations, and improving the clarity and rigor of our manuscript. We appreciate the time and effort you have dedicated to reviewing our work, and we are grateful for the opportunity to enhance our study based on your recommendations.

      References:

      Bergmann, T. O., Mölle, M., Diedrichs, J., Born, J., & Siebner, H. R. (2012). Sleep spindle-related reactivation of category-specific cortical regions after learning face-scene associations. NeuroImage, 59(3), 2733-2742.

      Buzsáki, G. (2015). Hippocampal sharp wave‐ripple: A cognitive biomarker for episodic memory and planning. Hippocampus, 25(10), 1073-1188.

      Caporro, M., Haneef, Z., Yeh, H. J., Lenartowicz, A., Buttinelli, C., Parvizi, J., & Stern, J. M. (2012). Functional MRI of sleep spindles and K-complexes. Clinical neurophysiology, 123(2), 303-309.

      Coulon, P., Budde, T., & Pape, H.-C. (2012). The sleep relay—the role of the thalamus in central and decentral sleep regulation. Pflügers Archiv-European Journal of Physiology, 463, 53-71.

      Crunelli, V., Lőrincz, M. L., Connelly, W. M., David, F., Hughes, S. W., Lambert, R. C., Leresche, N., & Errington, A. C. (2018). Dual function of thalamic low-vigilance state oscillations: rhythm-regulation and plasticity. Nature Reviews Neuroscience, 19(2), 107-118.

      Czisch, M., Wehrle, R., Stiegler, A., Peters, H., Andrade, K., Holsboer, F., & Sämann, P. G. (2009). Acoustic oddball during NREM sleep: a combined EEG/fMRI study. PloS one, 4(8), e6749.

      Diba, K., & Buzsáki, G. (2007). Forward and reverse hippocampal place-cell sequences during ripples. Nature Neuroscience, 10(10), 1241.

      Diekelmann, S., & Born, J. (2010). The memory function of sleep. Nature Reviews Neuroscience, 11(2), 114-126.

      Fogel, S., Albouy, G., King, B. R., Lungu, O., Vien, C., Bore, A., Pinsard, B., Benali, H., Carrier, J., & Doyon, J. (2017). Reactivation or transformation? Motor memory consolidation associated with cerebral activation time-locked to sleep spindles. PloS one, 12(4), e0174755.

      Hahn, M. A., Heib, D., Schabus, M., Hoedlmoser, K., & Helfrich, R. F. (2020). Slow oscillation-spindle coupling predicts enhanced memory formation from childhood to adolescence. Elife, 9, e53730.

      Halassa, M. M., Siegle, J. H., Ritt, J. T., Ting, J. T., Feng, G., & Moore, C. I. (2011). Selective optical drive of thalamic reticular nucleus generates thalamic bursts and cortical spindles. Nature Neuroscience, 14(9), 1118-1120.

      Hale, J. R., White, T. P., Mayhew, S. D., Wilson, R. S., Rollings, D. T., Khalsa, S., Arvanitis, T. N., & Bagshaw, A. P. (2016). Altered thalamocortical and intra-thalamic functional connectivity during light sleep compared with wake. NeuroImage, 125, 657-667.

      Helfrich, R. F., Lendner, J. D., Mander, B. A., Guillen, H., Paff, M., Mnatsakanyan, L., Vadera, S., Walker, M. P., Lin, J. J., & Knight, R. T. (2019). Bidirectional prefrontal-hippocampal dynamics organize information transfer during sleep in humans. Nature Communications, 10(1), 3572.

      Helfrich, R. F., Mander, B. A., Jagust, W. J., Knight, R. T., & Walker, M. P. (2018). Old brains come uncoupled in sleep: slow wave-spindle synchrony, brain atrophy, and forgetting. Neuron, 97(1), 221-230. e224.

      Horovitz, S. G., Fukunaga, M., de Zwart, J. A., van Gelderen, P., Fulton, S. C., Balkin, T. J., & Duyn, J. H. (2008). Low frequency BOLD fluctuations during resting wakefulness and light sleep: A simultaneous EEG‐fMRI study. Human brain mapping, 29(6), 671-682.

      Huang, Q., Xiao, Z., Yu, Q., Luo, Y., Xu, J., Qu, Y., Dolan, R., Behrens, T., & Liu, Y. (2024). Replay-triggered brain-wide activation in humans. Nature Communications, 15(1), 7185.

      Ilhan-Bayrakcı, M., Cabral-Calderin, Y., Bergmann, T. O., Tüscher, O., & Stroh, A. (2022). Individual slow wave events give rise to macroscopic fMRI signatures and drive the strength of the BOLD signal in human resting-state EEG-fMRI recordings. Cerebral Cortex, 32(21), 4782-4796.

      Laufs, H. (2008). Endogenous brain oscillations and related networks detected by surface EEG‐combined fMRI. Human brain mapping, 29(7), 762-769.

      Laufs, H., Walker, M. C., & Lund, T. E. (2007). ‘Brain activation and hypothalamic functional connectivity during human non-rapid eye movement sleep: an EEG/fMRI study’—its limitations and an alternative approach. Brain, 130(7), e75.

      Margulies, D. S., Ghosh, S. S., Goulas, A., Falkiewicz, M., Huntenburg, J. M., Langs, G., Bezgin, G., Eickhoff, S. B., Castellanos, F. X., & Petrides, M. (2016). Situating the default-mode network along a principal gradient of macroscale cortical organization. Proceedings of the National Academy of Sciences, 113(44), 12574-12579.

      Massimini, M., Huber, R., Ferrarelli, F., Hill, S., & Tononi, G. (2004). The sleep slow oscillation as a traveling wave. Journal of Neuroscience, 24(31), 6862-6870.

      Moehlman, T. M., de Zwart, J. A., Chappel-Farley, M. G., Liu, X., McClain, I. B., Chang, C., Mandelkow, H., Özbay, P. S., Johnson, N. L., & Bieber, R. E. (2019). All-night functional magnetic resonance imaging sleep studies. Journal of neuroscience methods, 316, 83-98.

      Molle, M., Bergmann, T. O., Marshall, L., & Born, J. (2011). Fast and slow spindles during the sleep slow oscillation: disparate coalescence and engagement in memory processing. Sleep, 34(10), 1411-1421.

      Ngo, H.-V., Fell, J., & Staresina, B. (2020). Sleep spindles mediate hippocampal-neocortical coupling during long-duration ripples. Elife, 9, e57011.

      Picchioni, D., Horovitz, S. G., Fukunaga, M., Carr, W. S., Meltzer, J. A., Balkin, T. J., Duyn, J. H., & Braun, A. R. (2011). Infraslow EEG oscillations organize large-scale cortical– subcortical interactions during sleep: a combined EEG/fMRI study. Brain research, 1374, 63-72.

      Schabus, M., Dang-Vu, T. T., Albouy, G., Balteau, E., Boly, M., Carrier, J., Darsaud, A., Degueldre, C., Desseilles, M., & Gais, S. (2007). Hemodynamic cerebral correlates of sleep spindles during human non-rapid eye movement sleep. Proceedings of the National Academy of Sciences, 104(32), 13164-13169.

      Schreiner, T., Kaufmann, E., Noachtar, S., Mehrkens, J.-H., & Staudigl, T. (2022). The human thalamus orchestrates neocortical oscillations during NREM sleep. Nature communications, 13(1), 5231.

      Schreiner, T., Petzka, M., Staudigl, T., & Staresina, B. P. (2021). Endogenous memory reactivation during sleep in humans is clocked by slow oscillation-spindle complexes. Nature Communications, 12(1), 3112.

      Singh, D., Norman, K. A., & Schapiro, A. C. (2022). A model of autonomous interactions between hippocampus and neocortex driving sleep-dependent memory consolidation. Proceedings of the National Academy of Sciences, 119(44), e2123432119.

      Spoormaker, V. I., Schröter, M. S., Gleiser, P. M., Andrade, K. C., Dresler, M., Wehrle, R., Sämann, P. G., & Czisch, M. (2010). Development of a large-scale functional brain network during human non-rapid eye movement sleep. Journal of Neuroscience, 30(34), 11379-11387.

      Staresina, B. P., Bergmann, T. O., Bonnefond, M., van der Meij, R., Jensen, O., Deuker, L., Elger, C. E., Axmacher, N., & Fell, J. (2015). Hierarchical nesting of slow oscillations, spindles and ripples in the human hippocampus during sleep. Nature Neuroscience, 18(11), 1679-1686.

      Staresina, B. P., Niediek, J., Borger, V., Surges, R., & Mormann, F. (2023). How coupled slow oscillations, spindles and ripples coordinate neuronal processing and communication during human sleep. Nature Neuroscience, 1-9.

      Yarkoni, T., Poldrack, R. A., Nichols, T. E., Van Essen, D. C., & Wager, T. D. (2011). Large-scale automated synthesis of human functional neuroimaging data. Nature methods, 8(8), 665-670.

      Yeshurun, Y., Nguyen, M., & Hasson, U. (2021). The default mode network: where the idiosyncratic self meets the shared social world. Nature Reviews Neuroscience, 1-12.

    1. Author response:

      Public Reviews:

      Reviewer #1 (Public review):

      Summary

      The authors describe a method for gastruloid formation using mouse embryonic stem cells (mESCs) to study YS and AGM-like hematopoietic differentiation. They characterise the gastruloids during nine days of differentiation using a number of techniques including flow cytometry and single-cell RNA sequencing. They compare their findings to a published data set derived from E10-11.5 mouse AGM. At d9, gastruloids were transplanted under the adrenal gland capsule of immunocompromised mice to look for the development of cells capable of engrafting the mouse bone marrow. The authors then applied the gastruloid protocol to study overexpression of Mnx1 which causes infant AML in humans.

      In the introduction, the authors define their interpretation of the different waves of hematopoiesis that occur during development. 'The subsequent wave, known as definitive, produces: first, oligopotent erythro-myeloid progenitors (EMPs) in the YS (E8-E8.5); and later myelo-lymphoid progenitors (MLPs - E9.5-E10), multipotent progenitors (MPPs - E10-E11.5), and hematopoietic stem cells (HSCs - E10.5-E11.5), in the aorta-gonadmesonephros (AGM) region of the embryo proper.' Herein they designate the yolk sac-derived wave of EMP hematopoiesis as definitive, according to convention, although paradoxically it does not develop from intraembryonic mesoderm or give rise to HSCs.

      The apparent perplexity of the Reviewer with our definition of primitive and definitive waves is somewhat surprising, as it is widely used in the field (e.g. PMID: 18204427; PMID: 28299650; PMID: 33681211). Definitive haematopoiesis, encompassing EMP, MLP, MPP and HSC, highlights their origin from haemogenic hendothelium, generation of mature cells with adult characteristics from progenitors with multilineage potential and direct and indirect developmental contributions to the intra-embryonic and time-restricted generation of HSCs.

      General comments

      The authors make the following claims in the paper:

      (1) The development of a protocol for hemogenic gastruloids (hGx) that recapitulates YS and AGM-like waves of blood from HE.

      (2) The protocol recapitulates both YS and EMP-MPP embryonic blood development 'with spatial and temporal accuracy'.

      (3) The protocol generates HSC precursors capable of short-term engraftment in an adrenal niche.

      (4) Overexpression of MNX1 in hGx transforms YS EMP to 'recapitulate patient transcriptional signatures'.

      (5) hGx is a model to study normal and leukaemic embryonic hematopoiesis.

      There are major concerns with the manuscript. The statements and claims made by the authors are not supported by the data presented, data is overinterpreted, and the conclusions cannot be justified. Furthermore, the data is presented in a way that makes it difficult for the reader to follow the narrative, causing confusion. The authors have not discussed how their hGx compares to the previously published mouse embryoid body protocols used to model early development and hematopoiesis. the data is presented in a way that makes it difficult for the reader to follow the narrative, causing confusion. The authors have not discussed how their hGx compares to the previously published mouse embryoid body protocols used to model early development and hematopoiesis.

      Specific points

      (1) It is claimed that HGxs capture cellularity and topography of developmental blood formation. The hGx protocol described in the manuscript is a modification of a previously published gastruloid protocol (Rossi et al 2022). The rationale for the protocol modifications is not fully explained or justified. There is a lack of novelty in the presented protocol as the only modifications appear to be the inclusion of Activin A and an extension of the differentiation period from 7 to 9 days of culture. No direct comparison has been made between the two versions of gastruloid differentiation to justify the changes.

      The Reviewer paradoxically claims that the protocol is not novel and that it differs from a previous publication in at least 2 ways – the patterning pulse and the length of the protocol. Of these, the patterning pulse is key. As documented in Fig. S1, we cannot obtain Flk1-GFP expression in the absence of Activin A. Expression of Flk1 is a fundamental step in haemato-endothelial specification and, accordingly, we do not see CD41 or CD45+ cells in the absence of Activin A. Also, in our hands, there is a clear time-dependent progression of marker expression, with sequential acquisition of CD41 and CD45, with the latter not detectable until 192h (Fig. 1C-D), another key difference relative to the Rossi et al (2022) protocol. The 192h-timepoint, we argue in the manuscript, and present further evidence for in this rebuttal, corresponds to the onset of AGM-like haematopoiesis. We have empirically extended the protocol to maximise the CD45+ cell output (Fig. S1B-D).

      The inclusion of Activin A at high concentration at the beginning of differentiation would be expected to pattern endoderm rather than mesoderm. BMP signaling is required to induce Flk1+ mesoderm, even in the presence of Wnt.

      Again, we call the Reviewer’s attention to Fig. S1 which clearly shows that Activin A (with no BMP added) is required for induction of Flk1 expression, in the presence of Wnt. Activin A in combination with Wnt, is used in other protocols of haemato-endothelial differentiation from pluripotent cells, with no BMP added in the same step of patterning and differentiation (PMID: 39227582; PMID: 39223325). In the latter protocol, we also call the Reviewer’s attention to the fact that a higher concentration of Activin A precludes the need for BMP4 addition. Finally, one of us has recently reported that Activin A, on its own, will induce FLK1, as well as other anterior mesodermal progenitors (https://www.biorxiv.org/content/10.1101/2025.01.11.632562v1)..) In addressing the Reviewer’s concerns with the dose of Activin A used, we titrated its concentration against activation of Flk1, confirming optimal Flk1-GFP expression at the 100ng/ml dose used in the manuscript.

      Author response image 1.

      Dose-dependent requirement of Activin A for induction of Flk1 expression in haemogenic gastruloids. Composite GFP and brightfield live imaging of Flk1-GFP haemogenic gastruloids at 96h. Images were acquired using a Cytation5 instrument (Thermo). Images are representative of 12 gastruloids per condition.

      FACS analysis of the hGx during differentiation is needed to demonstrate the co-expression of Flk1-GFP and lineage markers such as CD34 to indicate patterning of endothelium from Flk1+ mesoderm. The FACS plots in

      Fig. 1 show c-Kit expression but very little VE-cadherin which suggests that CD34 is not induced. Early endoderm expresses c-Kit, CXCR4, and Epcam, but not CD34 which could account for the lack of vascular structures within the hGx as shown in Fig. 1E.

      We were surprised by the Reviewer’s comment that there are no endothelial structures in our gastruloids. The presence of a Flk1-GFP+ network is visible in the GFP images in Fig.1B, from 144h onwards, also shown in Author response image 2A. In addition, our single-cell RNA-seq data, included in the manuscript, confirms the presence of endothelial cells with a developing endothelial, including arterial, programme. This can be seen in Fig. 2B, F of the manuscript and is represented in Author response image 2B. In contrast with the Reviewer’s claims that no endothelial cells are formed, the data show that Kdr (Flk1)+ cells co-express Cdh5/VE-Cadherin and indeed Cd34, attesting to the presence of an endothelial programme. Arterial markers Efnb2, Flt1, and Dll4 are present. A full-blown programme, which also includes haemogenic markers including Sox17, Esam, Cd44 and Mecom is clear at early (144h) and, particularly at late (192h) timepoints in cells sorted on detection of surface c-Kit (Author response image 2B). Further to the data shown in B, already present in the manuscript, we also document co-expression of Flk1-GFP and CD34 by flow cytometry (Author response image 2C).

      Author response image 2.

      Haemogenic gastruloids have a branched vascular network. A. Whole-mount confocal imaging of 144h-haemogenic gastruloids. B. Differentiation of an arterial endothelial programme in haemogenic gastruloids; singlecell RNA-seq data of differentiating haemogenic gastruloids, sorted on cell surface expression of c-Kit at 144 and 192h; gene expression colour scale from yellow (low) to orange (high); grey = no detectable expression. C. Flow cytometry plots of 216h-haemogenic gastruloids showing detection of haemato-endothelial marker CD34.

      (2) The protocol has been incompletely characterised, and the authors have not shown how they can distinguish between either wave of Yolk Sac (YS) hematopoiesis (primitive erythroid/macrophage and erythro-myeloid EMP) or between YS and intraembryonic Aorta-Gonad-Mesonephros (AGM) hematopoiesis. No evidence of germ layer specification has been presented to confirm gastruloid formation, organisation, and functional ability to mimic early development. Furthermore, differentiation of YS primitive and YS EMP stages of development in vitro should result in the efficient generation of CD34+ endothelial and hematopoietic cells. There is no flow cytometry analysis showing the kinetics of CD34 cell generation during differentiation. Benchmarking the hGx against developing mouse YS and embryo data sets would be an important verification.

      The Reviewer is correct that we have not provided detailed characterisation of the different germ layers, as this was not the focus of the study. In that context, we were surprised by the earlier comment assuming co-expression of c-kit, Cxcr4 and Epcam, which we did not show, while overlooking the endothelial programme reiterated above, which we have presented.

      Given our focus on haemato-endothelial specification, we have started the single-cell RNA-seq characterisation of the haemogenic gastruloid at 120h and have not looked specifically at earlier timepoints of embryo patterning.

      This said, we show the presence of neuroectodermal cells in cluster 9; on the other hand, cluster 7 includes hepatoblast-like cells, denoting endodermal specification. We are happy to include this characterisation, to the extent that it is present, in a revised version of the manuscript. However, in the absence of earlier timepoints and given the bias towards mesodermal specification, we expect that specification of ectodermal and endodermal programmes may be incomplete.

      In respect of the contention regarding the capture of YS-like and AGM-like haematopoiesis, we have presented evidence in the manuscript that haemogenic cells generated during gastruloid differentiation, particularly at late 192h and 216h timepoints project onto highly purified c-Kit+ CD31+ Gfi1-expressing cells from mouse AGM (PMID: 38383534), providing support for the recapitulation of the corresponding developmental stage. In distinguishing between YS-like and AGM-like haematopoiesis, we call the Reviewer’s attention to the replotting of the single-cell RNA-seq data already in the manuscript, which we provided in response to point 1 (Author response image 2B), which highlights an increase in Sox17, but not Sox18, expression in the 192h haemogenic endothelium, which suggests an association with AGM haematopoiesis (PMID: 20228271). A significant association of Cd44 and Procr expression with the same time-point (Fig. 2F in the manuscript), further supports an AGM-like endothelial-to-haematopoietic transition at the 192h timepoint.

      Following on the Reviewer’s comments about CD34, we also inspected co-expression of CD34 with CD41 and CD45, the latter co-expression present in, although not necessarily exclusive to, AGM haematopoiesis.

      Reassuringly, we observed clear co-expression with both markers (Author response image 3), in addition to a CD41+CD34-population, which likely reflects YS EMP-independent erythropoiesis. Interestingly, marker expression is responsive to the levels of Activin A used in the patterning pulse, with the 100ng/ml Activin A used in our protocol superior to 75ng/ml.

      Author response image 3.

      Association of CD34 with CD41 and CD45 expression is Activin A-responsive and supports the presence of definitive haematopoiesis. A. Flow cytometry analysis of CD34 and CD41 expression in 216h-haemogenic gastruloids; two doses of Activin A were used in the patterning pulse with CHI99021 between 48-72h. FMO controls shown. B. Flow cytometry analysis of CD34 and CD45 at 216h in the same experimental conditions.

      We agree that it remains challenging to identify markers exclusive to AGM haematopoiesis, which is operationally equated with generation of transplantable haematopoietic stem cells. While HSC generation is a key event characteristic of the AGM, not all AGM haematopoiesis corresponds to HSCs, an important point in evaluating the data presented in the manuscript, and indeed acknowledged by us.

      Author response image 4.

      Clustering of haemogenic gastruloid cells sorted on the basis of haemato-endothelial surface markers CD41, C-Kit and CD45. A. Leiden clustering to single-cell RNA-seq data. B. Time stamps of sorted haemogenic gastruloid cells in A. C. Surface marker stamps of cells in A.

      Given the centrality of this point in comments by all the Reviewers, we have conducted projections of our single-cell RNA-seq data against two studies which (1) capture arterial and haemogenic specification in the para-splanchnopleura (pSP) and AGM region between E8.0 and E11 (Hou et al, PMID: 32203131), and (2) uniquely capture YS, AGM and FL progenitors and the AGM endothelial-to-haematopoietic transition (EHT) in the same scRNA-seq dataset (Zhu et al, PMID: 32392346).

      Focusing the analysis on the subsets of haemogenic gastruloid cells sorted as CD41+ (144h) CKit+ (144h and 192h) and CD45+ (192h and 216h) (Author response image 4AC), we show:

      (1) That a subset of haemato-endothelial cells from haemogenic gastruloids at 144h to 216h project onto intra-embryonic cells spanning E8.25 to E10 (Author response image 5A-B). This is in agreement with our interpretation that 216h are no later than the MPP/pre-HSC state of embryonic development, requiring further maturation to generate long-term engrafting HSC.

      (2) That haemogenic gastruloids contain YS-like (including EMP-like) and AGM-like haematopoietic cells (Author response image 6A-B). Significantly, some of the cells, particularly c-Kit-sorted cells with a candidate endothelial and HE-like signature project onto AGM pre-HE and HE, as well as IAHC, and later, predominantly 216h cells, have characteristics of MPP/LMPP-like cells from the FL.

      Altogether, the data support the notion that haemogenic gastruloids capture YS and AGM haematopoiesis until E10, as suggested by us in the manuscript. We thought it was important to share this preliminary data with the Editors at an early stage, and we will incorporate a deeper analysis in a revised version of the manuscript.

      Single-cell RNA sequencing was used to compare hGx with mouse AGM. The authors incorrectly conclude that ' ..specification of endothelial and HE cells in hGx follows with time-dependent developmental progression into putative AGM-like HE..' And, '...HE-projected hGx cells.......expressed Gata2 but not Runx1, Myb, or Gfi1b..' Hemogenic endothelium is defined by the expression of Runx1 and Gfli1b is downstream of Runx1.

      As a hierarchy of regulation, Gata2 precedes and drives Runx1 expression at the specification of HE (PMID: 17823307; PMID: 24297996), while Runx1 drives the EHT, upstream of Gfi1b in haematopoietic clusters (PMID: 34517413).

      Author response image 5.

      Projection of sorted haemogenic gastruloid cells onto Hou et al dataset (PMID: 32203131) analysing development of mouse intra-embryonic haematopoiesis. A. Time signatures of Hou et al data. B. Projection of Leiden clusters in Author response image 4A. Methodology as described in our manuscript; 68% gastruloid cells projected.

      Author response image 6.

      Projection of sorted haemogenic gastruloid cells onto Zhu et al dataset (PMID: 32392346), capturing arterial endothelial and haemogenic endothelial development, in reference to YS, AGM and FL haematopoietic progenitors. A. Functional cluster classification as per Zhu et al. B. Projection of Leiden clusters in Author response image 4A. Methodology as detailed in our manuscript; 58% gastruloid cells projected. Haematopoietic clusters annotated as in A.

      (3) The hGx protocol 'generates hematopoietic SC precursors capable of short-term engraftment' is not supported by the data presented. Short-term engraftment would be confirmed by flow cytometric detection of hematopoietic cells within the recipient bone marrow, spleen, thymus, and peripheral blood that expressed the BFP transgene. This analysis was not provided. PCR detection of transcripts, following an unspecified number of amplification cycles, as shown in Figure 3G (incorrectly referred to as Figure 3F in the legend) is not acceptable evidence for engraftment.

      We provide the full flow cytometry analysis of spleen engraftment in the 5 mice which received implantation of 216h-haemogenic gastruloids in the adrenal gland; an additional (control) animal received adrenal injection of PBS (Author response image 7). The animals were analysed at 4 weeks. In this experiment, the bone marrow collection was limiting, and material was prioritised for PCR.

      We had previously provided only representative plots of flow cytometry analysis of bone marrow and spleen in Fig. S4E, which we described as low-level engraftment. The analysis was complemented with genomic DNA PCR, where detection was present in only some of the replicates tested per animal. We confirm that PCR analysis used conventional 40 cycles; the sensitivity was shown in Fig. S4F. As shown in Fig. 3 A-C, no more than 7 CD45+CD144+ multipotent cells are present per haemogenic gastruloid, with 3 haemogenic gastruloids implanted in the adrenal gland of each transplanted animal. We argue that the low level of cytometric and molecular engraftment at 4 weeks, from haemogenic gastruloid-derived progenitors that have not progressed beyond a stage equivalent to E10 Author response image 5A-B) and that we have described as requiring additional maturation in vivo, are not surprising.

      Author response image 7.

      BFP engraftment of Nude recipient mice 4 weeks after unilateral adrenal implantation of 216h-haemogenic gastruloids. Flow cytometry analysis of spleen engraftment. Genomic PCR analysis is shown in Fig. 3G of the manuscript.

      Transplanted hGx formed teratoma-like structures, with hematopoietic cells present at the site of transplant only analysed histologically. Indeed, the quality of the images provided does not provide convincing validation that donor-derived hematopoietic cells were present in the grafts.

      As stated in the text, the images mean to illustrate that the haemogenic gastruloids developed in situ. The observation of donor-derived blood cells in the implanted haemogenic gastruloids would not correspond to engraftment, as we have amply demonstrated that they have generated blood cells in vitro. There is no evidence that there are remaining pluripotent cells in the haemogenic gastruloid after 9 days of differentiation, and it is therefore not clear that these are teratomas

      There is no justification for the authors' conclusion that '... the data suggest that 216h hGx generate AGM-like pre-HSC capable of at least short-term multilineage engraftment upon maturation...'. Indeed, this statement is in conflict with previous studies demonstrating that pre-HSCs in the dorsal aorta of the mouse embryo are immature and actually incapable of engraftment.

      We have clearly stated that we do not see haematopoietic engraftment through transplantation of dissociated haemogenic gastruloids, which reach the E10 state containing pre-HSC (Author response image 5). Instead, we observed rare myelo-erythroid (in the manuscript) and myelo-lymphoid (Author response image 9 below, in response to Reviewer 2) engraftment upon in vivo maturation of haemogenic gastruloids with preserved 3D organisation. These statements are not contradictory.

      The statement '...low-level production of engrafting cells recapitulates their rarity in vivo, in agreement with the embryo-like qualities of the gastruloid system....' is incorrect. Firstly, no evidence has been provided to show the hGx has formed a dorsal aorta facsimile capable of generating cells with engrafting capacity. Secondly, although engrafting cells are rare in the AGM, approximately one per embryo, they are capable of robust and extensive engraftment upon transplantation.

      We are happy to rephrase the statement to simply say that “…the data suggest that 216h haemogenic gastruloids contain candidate AGM-like progenitors with some short-term engraftment potential but incomplete functional maturation.” To be clear, with our existing statement we meant to highlight that the production of definitive AGM-like haematopoietic progenitors (not all of which are engrafting) in haemogenic gastruloids does not correspond to non-physiological single-lineage programming. We did not claim that we achieved production of HSC, which would be long-term engrafting.

      (4) Expression MNX1 transcript and protein in hematopoietic cells in MNX1 rearranged acute myeloid leukaemia (AML) is one cause of AML in infants. In the hGX model of this disease, Mnx1 is overexpressed in the mESCs that are used to form gastruloids. Mnx1 overexpression seems to confer an overall growth advantage on the hGx and increase the serial replating capacity of the small number of hematopoietic cells that are generated. The inefficiency with which the hGx model generates hematopoietic cells makes it difficult to model this disease. The poor quality of the cytospin images prevents accurate identification of cells. The statement that the kit-expressing cells represent leukemic blast cells is not sufficiently validated to support this conclusion. What other stem cell genes are expressed? Surface kit expression also marks mast cells, frequently seen in clonogenic assays of blood cells. Flow cytometric and gene expression analyses using known markers would be required.

      The haemogenic gastruloid model generates haematopoietic and haemato-endothelial cells. MNX1 expands Kit+ cells at 144h, which we show to have a haemato-endothelial signature (manuscript Fig. 2B, which we replotted in Author response image 2B).

      Serial replating of CFC assays is a conventional in vitro assay of leukaemia transformation. Critically, colony replating is not maintained in EV control cells, attesting to the transformation potential of MNX1.

      Although we have not fully-traced the cellular hierarchy of MNX1-driven transformation in the haemogenic gastruloid system, the in vitro replating expands a Kit+ cell (Fig. 5E), which reflects the surface phenotype of the leukaemia, also recapitulated in the mouse model initiated by MNX1-overexpressing FL cells. Importantly, it recapitulates the transcriptional profile of MNX1-leukaemia patients (Fig. 6C), which is uniquely expressed by MNX1144h and replated colony cells, but not to MNX1 216h gastruloid cells, arguing against a generic signature of MNX1 overexpression (Fig. 6B). Importantly, the MNX1-transformation of haemogenic gastruloid cells is superior to the FL leukaemia model at capturing the unique transcriptional features of MNX1-driven leukaemia, distinct from other forms of AML in the same age group (Fig S7). It is possible that this corresponds to a preleukaemia event, and we will explore this in future studies, which are beyond the proof-of-principle nature of this paper.

      (5) In human infant MNX1 AML, the mutation is thought to arise at the fetal liver stage of development. There is no evidence that this developmental stage is mimicked in the hGx model.

      We never claim that the haemogenic gastruloid model mimics the foetal liver. We propose that susceptibility to MNX1 is at the HE-to-EMP transition. Moreover, and importantly, contrary to the Reviewer’s statement, there is no evidence in the literature that the mutation arises in the foetal liver stage, just that the mutation arises before birth (PMID: 38806630), which is different. In a mouse model of MNX1 overexpression, the authors achieve leukaemia engraftment upon MNX1 overexpression in foetal liver, but not in bone marrow cells (PMID: 37317878). This is in agreement with a vulnerability of embryonic / foetal, but not adult cells to the MNX1 expression caused by the translocation. However, haematopoietic cells in the foetal liver originate from YS and AGM precursors, so the origin of the MNX1-susceptible cells can be in those locations, rather than the foetal liver itself.

      Reviewer #2 (Public review):<br /> Summary:<br /> In this manuscript, the authors develop an exciting new hemogenic gastruloid (hGX) system, which they claim reproduces the sequential generation of various blood cell types. The key advantage of this cellular system would be its potential to more accurately recapitulate the spatiotemporal emergence of hematopoietic progenitors within their physiological niche compared to other available in vitro systems. The authors present a large set of data and also validate their new system in the context of investigating infant leukemia.<br /> Strengths:<br /> The development of this new in vitro system for generating hematopoietic cells is innovative and addresses a significant drawback of current in vitro models. The authors present a substantial dataset to characterize this system, and they also validate its application in the context of investigating infant leukemia.<br /> Weaknesses:<br /> The thorough characterization and full demonstration that the cells produced truly represent distinct waves of hematopoietic progenitors are incomplete. The data presented to support the generation of late yolk sac (YS) progenitors, such as lymphoid cells, and aortic-gonad-mesonephros (AGM)-like progenitors, including pre-hematopoietic stem cells (pre-HSCs), by this system are not entirely convincing. Given that this is likely the manuscript's most crucial claim, it warrants further scrutiny and direct experimental validation. Ideally, the identity of these progenitors should be further demonstrated by directly assessing their ability to differentiate into lymphoid cells or fully functional HSCs. Instead, the authors primarily rely on scRNA-seq data and a very limited set of markers (e.g., Ikzf1 and Mllt3) to infer the identity and functionality of these cells. Many of these markers are shared among various types of blood progenitors, and only a well-defined combination of markers could offer some assurance of the lymphoid and pre-HSC nature of these cells, although this would still be limited in the absence of functional assays.<br /> The identification of a pre-HSC-like CD45⁺CD41⁻/lo c-Kit⁺VE-Cadherin⁺ cell population is presented as evidence supporting the generation of pre-HSCs by this system, but this claim is questionable. This FACS profile may also be present in progenitors generated in the yolk sac such as early erythro-myeloid progenitors (EMPs). It is only within the AGM context, and in conjunction with further functional assays demonstrating the ability of these cells to differentiate into HSCs and contribute to long-term repopulation, that this profile could be strongly associated with pre-HSCs. In the absence of such data, the cells exhibiting this profile in the current system cannot be conclusively identified as true pre-HSCs.

      At this preliminary response stage, we present 2 additional pieces of evidence to support our claims that we capture YS and AGM stages of haematopoietic development. In future experiments, we can complement these with functional assays, including co-culture with OP9 and OP9-DL stroma.

      Author response image 8.

      EZH2 inhibition affects CD41+ cellular output in haemogenic gastruloids at 144, but not 216h. A. Flow cytometry analysis of CD41 expression in 144h-haemogenic gastruloid treated with 0.5μM EZH2 inhibitor GSK126 from 120h. DMSO (0.05%), vehicle. 1 of 2 independent experiments (average CD41+: DMSO, 21.20%; GSK126, 12.10%; CD45 not detected). B. Flow cytometry analysis of CD41 and CD45 expression in 216h gastruloids, treated with DMSO or GSK216. (DMSO: average CD41+, 15.28%; average CD45+ 0.46%. GSK126: average CD41+, 23.78%; average CD45+, 2.08%).

      In Author response images 5 and 6, we project our single-cell RNA-seq data onto (1) developing intra-embryonic pSP and AGM between E8 and E11 (Author response image 5) and (2) a single-cell RNA-seq study of HE development which combines haemogenic and haematopoietic cells from the YS, the developing HE and IAHC in the AGM, and FL (Author response image 6). Our data maps E8.25-E10 (Author response image 5) and captures YS EMP and erythroid and myeloid progenitors, as well as AGM pre-HE, HE and IAHC, with some cells matching HSPC and LMPP (Author response image 6), as suggested by the projection onto the Thambyrajah et al data set (Fig. S3 in the manuscript).

      Given the difficulty in finding markers that specifically associate with AGM haematopoiesis, we inspected the possibility of capturing different regulatory requirements at different stages of gastruloid development mirroring differential effects in the embryo. Polycomb EZH2 is specifically required for EMP differentiation in the YS, but does not affect AGM-derived haematopoiesis; it is also not required for primitive erythroid cells (PMID: 29555646; PMID: 34857757). We treated haemogenic gastruloids from 120h onwards with either DMSO (0.05%) or GSK126 (0.5μM), and inspected the cellularity of gastruloids at 144h, which we equate with YS-EMP, and 216h – putatively AGM haematopoiesis (Author response image 8). We show that EZH2 inhibition / GSK126 treatment specifically reduces %CD41+ cells at 144h (Author response image 8A), but does not reduce %CD41+ or %CD45+ cells at 216h (Author response image 8B).

      Although preliminary, these data, together with the scRNA-seq projections described, provide evidence to our claim that 144h haemogenic gastruloids capture YS EMPs, while CD41+ and CD45+ cells isolated at 216h reflect AGM progenitors. We cannot conclude as to the functional nature of the AGM cells from this experiment.

      The engraftment data presented are also not fully convincing, as the observed repopulation is very limited and evaluated only at 4 weeks post-transplantation. The cells detected after 4 weeks could represent the progeny of EMPs that have been shown to provide transient repopulation rather than true HSCs.

      We clearly state that there is low level engraftment and do not claim to have generated HSC. We describe cells with short-term engraftment potential. Although the cells we show in the manuscript at 4 weeks could be EMPs (Author response image 7 and Fig. 3 and S3), we now have 8-week analysis of implant recipients, in which we observed, again low-level, engraftment of the recipient bone marrow in 1:3 animals (Author response image 9). This engraftment is myeloid-lymphoid and therefore likely to have originated in a later progenitor. To be clear, we do not claim that this corresponds to the presence of HSC. It nevertheless supports the maturation of progenitors with engraftment potential.

      Author response image 9.

      Flow cytometry BFP engraftment of recipient bone marrow 8-weeks post implantation of 216hhaemogenic gastruloids in the adrenal gland of Nude mice. 1:3 animals show BFP CD45+ engraftment in the myeloid (Mac1+) and B-lymphoid (B220+) lineages. 3 haemogenic gastruloids were implanted unilaterally in the adrenal gland of each animal. A. Engrafted animal, showing CD45+ BFP cells of myeloid (CD11b) and B-lymphoid affiliation (B220). B. Non-engrafted mouse recipient of haemogenic gastruloid implants.

      Reviewer #3 (Public review):<br /> In this study, the authors employ a mouse ES-derived "hemogenic gastruloid" model which they generated and which they claim to be able to deconvolute YS and AGM stages of blood production in vitro. This work could represent a valuable resource for the field. However, in general, I find the conclusions in this manuscript poorly supported by the data presented. Importantly, it isn't clear what exactly are the "YS" and the "AGM"-like stages identified in the culture and where is the data that backs up this claim. In my opinion, the data in this manuscript lack convincing evidence that can enable us to identify what kind of hematopoietic progenitor cells are generated in this system. Therefore, the statement that "our study has positioned the MNX1-OE target cell within the YS-EMP stage (line 540)" is not supported by the evidence presented in this study. Overall, the system seems to be very preliminary and requires further optimization before those claims can be made.<br /> Specific comments below:<br /> (1) The flow cytometric analysis of gastruloids presented in Figure 1 C-D is puzzling. There is a large % of c-Kit+ cells generated, but few VE-Cad+ Kit+ double positive cells. Similarly, there are many CD41+ cells, but very few CD45+ cells, which one would expect to appear toward the end of the differentiation process if blood cells are actually generated. It would be useful to present this analysis as consecutive gating (i.e. evaluating CD41 and CD45 within VE-Cad+ Kit+ cells, especially if the authors think that the presence of VE-Cad+ Kit+ cells is suggestive of EHT). The quantification presented in D is misleading as the scale of each graph is different.

      Fig. 1C-D provide an overview of haemogenic markers during the timecourse of haemogenic gastruloid differentiation, and does indeed show a late up-regulation of CD45, as the Reviewer points out would be expected. The %CD45+ cells is indeed low. However, we should point out that the haemogenic gastruloid protocol, although biased towards mesodermal outputs, does not aim to achieve pure haematopoietic specification, but rather place it in its embryo-like context. Consecutive gating at the 216h-timepoint is shown and quantified in Fig. 3A-B. We refute that the scale is misleading. It is a necessity to represent the data in a way that is interpretable by the reader: the gates (in C) are truly representative and annotated, as are the plot axes (in D).

      (2) The imaging presented in Figure 1E is very unconvincing. C-Kit and CD45 signals appear as speckles and not as membrane/cell surfaces as they should. This experiment should be repeated and nuclear stain (i.e. DAPI) should be included.

      We include the requested images below (Author response image 10).

      Author response image 10.

      Confocal images of haematopoietic production in haemogenic gastruloids. Wholemount, cleared haemogenic gastruloids were stained for CD45 (pseudo-coloured red) and c-Kit antigens (pseudo-coloured yellow) with indirect staining, as described in the manuscript. Flk1-GFP signal is shown in green. Nuclei are contrasted with DAPI. (A) 192h. (B) 216h.

      (3) Overall, I am not convinced that hematopoietic cells are consistently generated in these organoids. The authors should sort hematopoietic cells and perform May-Grunwald Giemsa stainings as they did in Figure 6 to confirm the nature of the blood cells generated.

      It is factual that the data are reproducible and complemented by functional assays shown in Fig. 3, which clearly demonstrate haematopoietic output. The single-cell RNA-seq data also show expression of a haematopoietic programme. Nevertheless, we include Giemsa-Wright’s stained cytospins obtained at 216h to illustrate haematopoietic output (Reviewer Fig. 11). Inevitably, the cytospins will be inconclusive as to the presence of endothelial-to-haematopoietic transition or the generation of haematopoietic stem/progenitor cells, as these cells do not have a distinctive morphology.

      Author response image 11.

      Cytospin of dissociated haemogenic gastruloids at 216h. Cytospins were stained with Giemsa-Wright’s stain and are visualised with a 40x objective. Annotated are cells in the monocytic (dashed open arrow), granulocytic (solid open arrow), megakaryocytic (solid arrow) and erythroid (asterisk) lineages; arrowheads indicate cells with a non-specific blast-like morphology. Representative image.

      (4) The scRNAseq in Figure 2 is very difficult to interpret. Specific points related to this:<br /> - Cluster annotation in Figure 2a is missing and should be included.<br /> - Why do the heatmaps show the expression of genes within sorted cells? Couldn't the authors show expression within clusters of hematopoietic cells as identified transcriptionally (which ones are they? See previous point)? Gene names are illegible.<br /> - I see no expression of Hlf or Myb in CD45+ cells (Figure 2G). Hlf is not expressed by any of the populations examined (panels E, F, G). This suggests no MPP or pre-HSC are generated in the culture, contrary to what is stated in lines 242-245. (PMID 31076455 and 34589491).<br /> Later on, it is again stated that "hGx cells... lacked detection of HSC genes like Hlf, Gfi1, or Hoxa9" (lines 281-283). To me, this is proof of the absence of AGM-like hematopoiesis generated in those gastruloids.

      Author response image 12.

      Expression of endothelial, haemogenic and haematopoietic genes in haemogenic gastruloid cells sorted at 144h, 192h and 216h. UMAP as in Author response image 4. Pecam (CD31) and CD34 represent endothelial genes also detected in haemogenic endothelium. CD44 is specifically enriched at the endothelial-to-haemogenic transition. Mecom is detected in haemogenic endothelium and haematopoietic progenitors. Mllt3 and Runx1 are haematopoietic markers. Hoxa9 and Hlf are associated with haematopoietic stem and progenitor cells and their detection is rare in haemogenic gastruloids at 216h.

      For a combination of logistic and technical reasons, we performed single-cell RNA-seq using the Smart-Seq2 platform, which is inherently low throughput. We overcame the issue of cell coverage by complementing whole-gastruloid transcriptional profiling at successive time-points with sorting of subpopulations of cells based on individual markers documented in Fig. 1. We clearly stated which platform was used as well as the number and type of cells profiled (Fig. S2A and lines 172-179 of the manuscript), and our approach is standard. We will review our representation of the data in a revised manuscript. Nevertheless, at this stage, we provide plots of the expression of key haematopoietic markers over UMAPs of haemogenic gastruloid timecourse (Author response image 12). We also show preliminary qRT-PCR data with increased Hlf expression upon extension of the protocol to 264h (Author response image 13), further confirming haematopoietic specification, including of candidate definitive progenitor cells, in the haemogenic gastruloid model.

      Author response image 13.

      Hlf expression is up-regulated in late stage haemogenic gastruloids. Quantitative RT-PCR analysis of Hlf expression in unfractionated haemogenic gastruloids cultured for 264h. From 168h onwards, haemogenic gastruloids were cultured in N2B27 in the presence of VEGF, SCF, FLT3L and TPO, all recombinant mouse cytokines, as described in the manuscript. Shown are mean±standard deviation of n=5 replicates from 2 mouse ES cell lines, respectively Flk1-GFP and Rosa26-BFP::Flk1-GFP, reported in the manuscript; 2-tailed unpaired t-test with Welch correction.

      (5) Mapping of scRNA-Seq data onto the dataset by Thambyrajah et al. is not proof of the generation of AGM HE. The dataset they are mapping to only contains AGM cells, therefore cells do not have the option to map onto something that is not AGM. The authors should try mapping to other publicly available datasets also including YS cells.

      We have done this and the data are presented in Author response image 5 and 6. As detailed in response to Reviewer 1, we have conducted projections of our single-cell RNA-seq data against two studies which (1) capture arterial and haemogenic specification in the para-splanchnopleura (pSP) and AGM region between E8.0 and E11 (Hou et al, PMID: 32203131) (Author response image 5), and (2) uniquely capture YS, AGM and FL progenitors and the AGM endothelial-to-haematopoietic transition (EHT) in the same scRNA-seq dataset (Zhu et al, PMID: 32392346) (Author response image 6). Specifically in answering the Reviewers’ point, we show that different subsets of haemogenic gastruloid cells sorted on haemogenic surface markers c-Kit, CD41 and CD45 cluster onto pre-HE and HE, intra-aortic clusters and FL progenitor compartments, and to YS EMP and erythroid and myeloid progenitors. This lends support to our claim that the haemogenic gastruloid system specifies both YS-like and AGM-like cells.

      (6) Conclusions in Figure 3, named "hGx specify cells with preHSC characteristics" are not supported by the data presented here. Again, I am not convinced that hematopoietic cells can be efficiently generated in this system, and certainly not HSCs or pre-HSCs.

      We have provided evidence, both in the manuscript and in this response to Reviewers, that there is haematopoietic specification, including of progenitor cells, in the haemogenic gastruloid system (Fig. 3 and Author response image 7,9). We have added data in this response that supports the specification of YS-like and AGM-like cells (Author response image 5-6, 8). Importantly, we have never claimed that haemogenic gastruloids generate HSC. We accept the Reviewer’s comment that we have not provided sufficient evidence for the specification of pre-HSC-like cells. We will re-phrase Fig. 3 conclusion as “Haemogenic gastruloids specify cells with characteristics of definitive haematopoietic progenitors”.

      - FACS analysis in 3A is again very unconvincing. I do not think the population identified as c-Kit+ CD144+ is real. Also, why not try gating the other way around, as commonly done (e.g. VE-Cad+ Kit+ and then CD41/CD45)?

      There is nothing unconventional about our gating strategy, which was done from a more populated gate onto the less abundant one to ensure that the results are numerically more robust. In the case of haemogenic gastruloids, unlike the AGM preparations the Reviewer may be referring to, CD41 and CD45+ cells are more abundant as there is no circulation of more differentiated haematopoietic cells away from the endothelial structures. This said, we did perform the gating as suggested (Author response image 14), indeed confirming that most VE-cad+ Kit+ cells are CD45+. Interestingly VE-cad+Kit- are predominantly CD41+, reinforcing the true haematopoietic nature of these cells.

      Author response image 14.

      Flow cytometry analysis of VE-cadherin+ cells in haemogenic gastruloids at 216h of the differentiation protocol, probing co-expression of CD45, CD41 and c-Kit.

      - The authors must have tried really hard, but the lack of short- or long-engraftment in a number of immunodeficient mouse models (lines 305-313) really suggests that no blood progenitors are generated in their system. I am not familiar with the adrenal gland transplant system, but it seems like a very non-physiological system for trying to assess the maturation of putative pre-HSCs. The data supporting the engraftment of these mice, essentially seen only by PCR and in some cases with a very low threshold for detection, are very weak, and again unconvincing. It is stated that "BFP engraftment of the Spl and BM by flow cytometry was very low level albeit consistently above control (Fig. S4E)" (lines 337-338). I do not think that two dots in a dot plot can be presented as evidence of engraftment.

      We have presented the data with full disclosure and do not deny that the engraftment achieved is low-level and short-term, indicating incomplete maturation of definitive haematopoietic progenitors in the current haemogenic gastruloid system. However, we call the Reviewer’s attention to the fact that detection of BFP+ cells by PCR and flow cytometry in the recipient animals at 4 weeks is consistent between the 2 methods (Author response image 7).

      Furthermore, we have now also been able to detect low-level myelo-lymphoid engraftment in the bone marrow 8 weeks after adrenal implantation, again suggesting the presence of a small number of definitive haematopoietic progenitors that potentially mature from the 3 haemogenic gastruloids implanted (Author response image 9).

      (7) Given the above, I find that the foundations needed for extracting meaningful data from the system when perturbed are very shaky at best. Nevertheless, the authors proceed to overexpress MNX1 by LV transduction, a system previously shown to transform fetal liver cells, mimicking the effect of the t(7;12) AML-associated translocation. Comments on this section:<br /> - The increase in the size of the organoid when MNX1 is expressed is a very unspecific finding and not necessarily an indication of any hematopoietic effect of MNX1 OE.

      We agree with the Reviewer on this point; it is nevertheless a reproducible observation which we thought relevant to describe for completeness and data reproducibility.

      - The mild increase of cKit+ cells (Figure 4E) at the 144hr timepoint and the lack of any changes in CD41+ or CD45+ cells suggests that the increase in Kit+ cells % is not due to any hematopoietic effect of MNX1 OE. No hematopoietic GO categories are seen in RNA seq analysis, which supports this interpretation. Could it be that just endothelial cells are being generated?

      The Reviewer is correct that the MNX1-overexpressing cells have a strong endothelial signature, which is present in the patients (Fig. 4A). We investigated a potential link with c-Kit by staining cells from the replating colonies during the process of in vitro transformation with CD31. We observed that 40-50% of c-Kit+ cells (20-30% total colony cells) co-expressed CD31(Author response image 15), at least at early plating. These cells co-exist with haematopoietic cells, namely Ter119+ cells, as expected from the YS-like erythroid and EMP-like affiliation of haematopoietic output from 144h-haemogenic gastruloids (Fig. 5F).

      Author response image 15.

      Endothelial affiliation of MNX1-oe replating cells from haemogenic gastruloid. A. Representative flow cytometry plot of plate 1 CFC from MNX1-overexpressing haemogenic gastruloids at 144h. B. Quantification of the proportion of CD31+c-Kit+ cells in plates 1 and 2 of MNX1-oe-driven in vitro transformation.

      (8) There seems to be a relatively convincing increase in replating potential upon MNX1-OE, but this experiment has been poorly characterized. What type of colonies are generated? What exactly is the "proportion of colony forming cells" in Figures 5B-D? The colony increase is accompanied by an increase in Kit+ cells; however, the flow cytometry analysis has not been quantified.

      Given the inability to replate control EV cells, there is not a population to compare with in terms of quantification. The level of c-Kit+ represented in Fig. 5E is achieved at plate 2 or 3 (depending on the experiment), both of which are significantly enriched for colony-forming cells relative to control (Fig. 5B, D).

      (9) Do hGx cells engraft upon MNX1-OE? This experiment, which appears not to have been performed, is essential to conclude that leukemic transformation has occurred.

      For the purpose of this study, we are satisfied with confirmation of in vitro transformation potential of MNX1 haemogenic gastruloids, which can be used for screening purposes. Although interesting, in vivo leukaemia engraftment from haemogenic gastruloids is beyond the scope of this study.

    1. Author response:

      We would like to thank the three reviewers for the careful review and thoughtful comments on our manuscript. In addition to providing useful suggestions, they uncovered some embarrassing oversights on our part, related to experimental details including number of embryos, and quantification of variance in the observed changes for some of the experiments, which were inadvertently omitted in the submission. We provide below an initial response to the reviewer’s public reviews and expect to submit a revised manuscript comprehensively addressing all their concerns.

      I would like to start by addressing some of their most critical comments related to validation of the tools used to reduce soxB1 gene family function in the embryo.  In the absence of the critical supplementary data that we inadvertently failed to include, the reviewers were left with an understandable, but we feel erroneous impression, that there was insufficient validation of mutant and knockdown tools. 

      Reviewer #2 says “The sox2y589 mutant line is not properly verified in this manuscript, which could be done by examining ant-Sox2 antibody labeling, Western blot analysis or…”

      This validation, which had been performed previously both with antibody staining and with western blot analysis, was inadvertently omitted from the supplementary data submitted with the paper. The western blot data is shown here.

      Author response image 1.

      Validation of sox2 mutant phenotype with Western blot.

      Lysates were prepared from 25 embryos selected as wild type or potentially mutant based on the “loss of L1” phenotype at 6 dpf. This polyclonal antibody recognizes within the last 16 amino acids of the C-terminal.

      Author response image 2.

      Validation of sox2 mutant phenotype with antibody staining.

      Though in this experiment there was considerable background in the red channel, and it shows the lateral line nerve, loss of nuclear Sox2 expression is evident in the deposited neuromast of an embryo identified as a mutant based on its delayed deposition of the L1 neuromast.

      This data and a repeat of the antibody staining showing the primordium with loss of Sox2 will be included in a revised manuscript.

      Furthermore, Reviewer #2 comments “the authors show that the anti-Sox2 and antiSox3 antibody labeling is reduced but not absent in sox2 MO1 and sox3 MO-injected embryos, but do not show antibody labeling of the sox2 MO and sox3 MO-double injected embryos to determine if there is an additional knockdown”

      This will be included in a revised manuscript.

      Reviewer #2:

      The authors acknowledge that the sox2 MO1 used in this manuscript also alters sox3 function, but do not redo the experiments with a specific sox2 MO

      This is not exactly true. Having discovered sox2 MO1 simultaneously reduces sox2 and sox3 function, three new morpholinos were obtained based on another paper (Kamachi et al 2008), which had quantitatively assessed efficacy of three sox2 specific morpholinos (sox2 MO2, sox2 MO3, and sox2 MO4). The effects of these morpholinos on the pattern of L1 deposition was compared to that of sox2 MO1. This comparison was shown in supplementary Figure 2 and is included below. It shows that the sox2 specific morpholinos resulted in a poorly penetrant delay in deposition of L1, comparable to that of a sox2 mutant, which was quantified in supplementary Figure 3B. The observations with these three sox2 specific morpholinos independently supported the observations made with the sox2 mutant that reduction of sox2 on its own results in a delay in deposition of the first neuromast with low penetrance and that to effectively examine the role of these SoxB1 genes in the primordium their function needs to be compromised in a combinatorial manner. A conclusion that was independently supported by observations made by crossing sox1a, sox2 and sox3 mutants (Figure 3 and Supplementary Figure 3). Therefore, even though the initial use of a sox2 morpholino, which simultaneously knocks down sox3, was unintentional, its use turned out to be useful. It allowed us to examine effects of knocking down sox2 and sox3 with a single morpholino. Furthermore, though this project was initiated more than 15 years ago to specifically understand sox2 function, our focus had shifted to understanding the role of soxB1 family members sox1a, sox2 and sox3 functioning together as an interacting system that regulates Wnt activity in the primordium. Considering this broader focus, reflected in the title of the paper, it was not a priority to repeat every experiment previously done with the sox2MO1 with the new sox2 specific morpholinos. Instead, having acknowledged the “limitations” of sox2MO1, we used it to better understand effects of combinatorial reduction of SoxB1 function.

      Reviewer #1:

      It is not exactly clear what underlies the apparent redundancy. It would be helpful if the soxb gene family member expression was reported after loss of each.

      As suggested by reviewer #1, we had previously looked changes in expression of each of the soxB1 factors following loss of individual soxB1 factors but not included it in the supplementary data with the original submission. Independent of a reproducible and consistent expansion sox1a expression into the trailing zone, following loss of sox2 function, which is reported in the paper and quantified here where 10/10 mutant embryos showed the expansion (compare region within bracket in WT and sox2<sup>-/-</sup>), no consistent changes in the expression of other soxB1 family members was observed as part of a mechanism that might account for compensation when function of a particular soxB1 factor is soxB1 factor is lost. The data shown above together with more extensive quantification of changes will be included in a revised version of the manuscript. At this time the only consistent change was the expansion of sox1a to the trailing zone when lost. The data trailing zone when sox2 function is lost. This change reflects dependence of sox1a on Wnt activity and the fact that Wnt activity expands into the trailing zone when sox2 function is lost.  

      Author response image 3.

      Reviewer #3:

      Given that the expression patterns of Sox1a and Sox3 are not merely different but are largely reciprocal, the mechanistic basis of their very similar double mutant phenotypes with Sox2 remains opaque.

      The simplest way to think about compensation for gene function in a network is to think of it being determined by expression of a homolog or another gene with a similar function being expressed in a similar or overlapping domain.  However, it is more useful to think of Sox2 function in the primordium as part of a interacting network of SoxB1 factors whose differential regulatory mechanisms create a robust system that simultaneously regulates two key aspects of Wnt activity in the primordium; how high Wnt activity is allowed to get in the leading zone and how effectively it is shut off to facilitate protoneuromast maturation in the trailing zone. These features of Wnt activity influence both when and where nascent protoneuromasts will form in the wake of a progressively shrinking Wnt system and where they undergo effective maturation and stabilization prior to deposition. Changes in individual SoxB1 expression patterns provide some hints about how some SoxB1 factors may compensate when function of one or more of these factors is compromised. However, a deeper understanding of robustness and “compensation” will require a systems level understanding of this gene regulatory network with computational models, which we are currently working on in our group. It remains possible, for example, that how far into the trailing zone the Wnt activity has an influence is regulated at least in part by how high it is allowed to get in the leading zone by sox1a. Conversely, how high Wnt activity gets in the leading zone may be influenced by how effectively it is shut off in the trailing zone by sox2 and sox3, as this influences the size of the Wnt system, which in turn can influence the overall level of Wnt activity. In this manner Sox1a may cooperate with Sox2 and Sox3 to limit both how high Wnt activity is allowed to get in the primordium and to effectively shut it off in the trailing zone.

      Reviewer #3:

      Related to this, the authors discuss that Sox1a/Sox2 double knockdown produces a more severe phenotype than Sox2/Sox3 double knockdown, yet this difference is not obviously reflected in the data.

      The severity of the sox1a/sox2 double mutant phenotype compared to that of the sox2/sox3 double mutant is shown in Figure 3 K and N, and quantified in Supplementary Figure 3A. Simultaneous loss of sox2 and sox3 results in a small but relatively penetrant delay in where the first stable neuromast is deposited (Figure 2 N). By contrast, loss of sox2 and sox1a together consistently results in a longer delay in deposition of the first stable (Figure 2 K). A new graph, shown below, which will be incorporated in the revised paper, shows that there is a significant difference in the pattern of L1 deposition in sox1a<sup>-/-</sup>, sox2<sup>-/-</sup> and sox2<sup>-/-</sup>, sox3<sup>-/-</sup> double mutants. 

      Author response image 4.

      All 3 datasets found to be normally distributed by Shapiro-Wilk test. 1-way ANOVA showed significance (<0.0001), with Tukey’s multiple comparisons test showing significant difference between all 3 conditions. (***p=0.0008, ****p<0.0001)

      Reviewer #1:

      It would be good to more clearly state why sox3 is not regulated by Wnt given its expression is inhibited by the delta TCF construct (Figure 2M).

      The explanation for why we believe sox3 expression is determined by Fgf signaling, and not Wnt activity requires integrating what is observed both with induction of the delta TCF construct and the dominant negative Fgf receptor (DN FgfR). Loss of sox3 expression with induced expression of the delta TCF construct could result from loss of Wnt activity or the downstream loss of Fgf activity, which is ultimately dependent on Fgfs secreted by Wnt active cells in the leading domain. Distinguishing between these possibilities is based on inhibition of FGF signaling with the DN FgfR, described in the next paragraph. Heat Shock induced expression of DN FgfR expression results in loss of FGF signaling and the simultaneous expansion of Wnt activity into the trailing zone. As explained in the original text, loss of sox3 expression in this context, rather than its expansion, suggests its expression is determined by Fgf signaling not Wnt activity. We will emphasize that its loss, rather than its expansion, following induction of DN FgfR, indicates its expression is determined by Fgf signaling not Wnt activity.

      Reviewer #2:

      The manuscript lacks quantification of many of the experiments, making it difficult to conclude their significance.

      One of the biggest inadvertent omissions of the paper was the inadequate quantification of some of the results. Quantification of results with considerable variation in the outcome, like the pattern of L1 deposition,  was provided following manipulations where various combinations of sox1a, sox2, and sox3 function was lost (Figures 3, supplementary Figures 2 and 3) or where sox2MO1/sox3MO was used with or without IWR (Figure 5 and Figure 6). However, numbers for the experiments in Figures 2 were omitted in the Figure legend, where typically about 10 embryos for each manipulation were photographed, scored, and a representative image was used to make the figure. In these experiments  there was a very consistent result with 100% of the embryos showing changes represented by each panel in Figure 2. The only exception was Figure 2Y where 9/10 embryos showed the described change. Similarly in Figure 4 there was a consistent result and 100% of embryos showed the change shown. Numbers and statistics for these results will be included in a revised manuscript.

      Reviewer #2:

      The statistical analysis in Figure 5 and Supplementary Figures 2 and 3 should be one-way ANOVA or Kruskal-Wallis with a Dunn's multiple comparisons test rather than pair-wise comparisons.

      The analysis has been re-done following the reviewer’s suggestions. The analysis confirms the primary conclusions of the original submission, and this analysis will be incorporated in a revised manuscript. However, to improve the power of the analysis, experiments with low numbers of embryos will be repeated.

      See redone graphs in Figure 5 and supplementary Figure 2 and 3.

    1. Author response:

      Reviewer #1 (Public Review):

      Summary:

      For many years, there has been extensive electrophysiological research investigating the relationship between local field potential patterns and individual cell spike patterns in the hippocampus. In this study, using state-of-the-art imaging techniques, they examined spike synchrony of hippocampal cells during locomotion and immobility states. In contrast to conventional understanding of the hippocampus, the authors demonstrated that hippocampal place cells exhibit prominent synchronous spikes locked to theta oscillations.

      Strengths:

      The voltage imaging used in this study is a highly novel method that allows recording not only suprathreshold-level spikes but also subthreshold-level activity. With its high frame rate, it offers time resolution comparable to electrophysiological recordings. Moreover, it enables the visualization of actual cell locations, allowing for the examination of spatial properties (e.g., Figure 4G).

      We thank the reviewer for pointing out the technical novelty of this work.

      Weaknesses:

      There is a notable deviation from several observations obtained through conventional electrophysiological recordings. Particularly, as mentioned below in detail, the considerable differences in baseline firing rates and no observations of ripple-triggered firing patterns raise some concerns about potential artifacts from imaging and analysis, such as cell toxicity, abnormal excitability, and false detection of spikes. While these findings are intriguing if the validity of these methods is properly proven, accepting the current results as new insights is challenging.

      We appreciate the reviewer’s insightful comments regarding the intriguing aspect of our findings. Indeed, the emergence of a novel form of CA1 population synchrony presents exciting implications for hippocampal memory research and beyond.

      While we acknowledge the deviations from conventional electrophysiological recordings, we respectfully contend that these differences do not necessarily imply methodological flaws. All experiments and analyses were conducted with meticulous adherence to established standards in the field.

      Regarding the observed variations in averaging firing rates, it is important to note the well-documented heterogeneity in CA1 pyramidal neuron firing rates, spanning from 0.01 to 10 Hz, with a skewed distribution toward lower frequencies (Mizuseki et al., 2013). Our exclusion criteria for neurons with low estimated firing rates may have inadvertently biased the selection towards more active neurons. Moreover, prior research has indicated that averaging firing rates tend to increase during exposure to novel environments (Karlsson et al., 2008), and among deep-layer CA1 pyramidal neurons (Mizuseki et al., 2011). Given our recording setup in a highly novel environment and the predominance of deep CA1 pyramidal neurons in our sample, the observed higher averaging firing rates could be influenced by these factors. Considering these points, our mean firing rates (3.2 Hz) are reasonable estimations compared to previously reported values obtained from electrophysiological recordings (2.1 Hz in McHugh et al., 1996 and 2.4-2.6 Hz in Buzsaki et al., 2003).

      Regarding concerns about potential cell toxicity, previous studies have shown that Voltron expression and illumination do not significantly alter membrane resistance, membrane capacitance, resting membrane potentials, spike amplitudes, and spike width (see Abdelfattah 2019, Science, Supplementary Figure 11 and 12). In our recordings, imaged neurons exhibit preserved membrane and dendritic morphology during and after experiments (Author response image 1), supporting the absence of significant toxicity.

      Author response image 1.

      Voltron-expressing neurons exhibit preserved membrane and dendritic morphology. (A) Images of two-photon z-stack maximum intensity projection showing Voltron-expressing neurons taken after voltage image experiments in vivo. (B) Post-hoc histological images of neurons being voltage-imaged.

      Regarding spike detection, we use validated algorithms (Abdelfattah et al., 2019 and 2023) to ensure robust and reliable detection of spikes. Spiking activity was first separated from slower subthreshold potentials using high-pass filtering. This way, a slow fluorescence increase will not be detected as a spike, even if its amplitude is large. We benchmarked the detection algorithm in computer simulation. The sensitivity and specificity of the algorithm exceed 98% at the level of signal-to-noise ratio of our recordings. While we acknowledge that a small number of spikes, particularly those occurring later in a burst, might be missed due to their smaller amplitudes (as illustrated in Figure 1 and 2 of the manuscript), we anticipate that any missed spikes would lead to a decrease rather than an increase in synchrony between neurons. Overall, we are confident that spike detection is performed in a rigorous and robust manner.

      To further strengthen these points, we will include the following in the revision:

      (1) Histological images of recorded neurons during and after experiments.

      (2) Further details regarding the validation of spike detection algorithms.

      (3) Analysis of publicly available electrophysiological datasets.

      (4) Discussion regarding the reasons behind the novelty of some of our findings compared to previous observations.

      In conclusion, we assert that our experimental and analysis approach upholds rigorous standards. We remain committed to reconciling our findings with previous observations and welcome further scrutiny and engagement from the scientific community to explore the intriguing implications of our findings.

      Reviewer #2 (Public Review):

      Summary:

      This study employed voltage imaging in the CA1 region of the mouse hippocampus during the exploration of a novel environment. The authors report synchronous activity, involving almost half of the imaged neurons, occurred during periods of immobility. These events did not correlate with SWRs, but instead, occurred during theta oscillations and were phased-locked to the trough of theta. Moreover, pairs of neurons with high synchronization tended to display non-overlapping place fields, leading the authors to suggest these events may play a role in binding a distributed representation of the context.

      We thank the reviewer for a thorough and thoughtful review of our paper.

      Strengths:

      Technically this is an impressive study, using an emerging approach that allows single-cell resolution voltage imaging in animals, that while head-fixed, can move through a real environment. The paper is written clearly and suggests novel observations about population-level activity in CA1.

      We thank the reviewer for pointing out the technical strength and the novelty of our observations.

      Weaknesses:

      The evidence provided is weak, with the authors making surprising population-level claims based on a very sparse data set (5 data sets, each with less than 20 neurons simultaneously recorded) acquired with exciting, but less tested technology. Further, while the authors link these observations to the novelty of the context, both in the title and text, they do not include data from subsequent visits to support this. Detailed comments are below:

      We understand the reviewer’s concerns regarding the size of the dataset. Despite this limitation, it is important to note that synchronous ensembles beyond what could be expected from chance (jittering) were detected in all examined data. In the revision, we plan to add more data, including data from subsequent visits, to further strengthen our findings.

      (1) My first question for the authors, which is not addressed in the discussion, is why these events have not been observed in the countless extracellular recording experiments conducted in rodent CA1 during the exploration of novel environments. Those data sets often have 10x the neurons simultaneously recording compared to these present data, thus the highly synchronous firing should be very hard to miss. Ideally, the authors could confirm their claims via the analysis of publicly available electrophysiology data sets. Further, the claim of high extra-SWR synchrony is complicated by the observation that their recorded neurons fail to spike during the limited number of SWRs recorded during behavior- again, not agreeing with much of the previous electrophysiological recordings.

      We understand the reviewer’s concern. We will examine publicly available electrophysiology datasets to gain further insights into any similarities and differences to our findings. Based on these results, we will discuss why these events have not been previously observed/reported.

      (2) The authors posit that these events are linked to the novelty of the context, both in the text, as well as in the title and abstract. However, they do not include any imaging data from subsequent days to demonstrate the failure to see this synchrony in a familiar environment. If these data are available it would strengthen the proposed link to novelty if they were included.

      We thank the reviewer’s constructive suggestion. We will acquire more datasets from subsequent visits to gain further insights into these synchronous events.

      3) In the discussion the authors begin by speculating the theta present during these synchronous events may be slower type II or attentional theta. This can be supported by demonstrating a frequency shift in the theta recording during these events/immobility versus the theta recording during movement.

      We thank the reviewer’s constructive suggestion. We did demonstrate a frequency shift to a lower frequency in the synchrony-associated theta during immobility than during locomotion (see Fig. 4B, the red vs. blue curves). We will enlarge this panel and specifically refer to it in the corresponding discussion paragraph.

      (4) The authors mention in the discussion that they image deep-layer PCs in CA1, however, this is not mentioned in the text or methods. They should include data, such as imaging of a slice of a brain post-recording with immunohistochemistry for a layer-specific gene to support this.

      We thank the reviewer’s constructive suggestion. We do have images of brain slices post-recordings (Author response image 2). Imaged neurons are clearly located in the deep CA1 pyramidal layer. We will add these images and quantification in the revised manuscript.

      Author response image 2.

      Imaged neurons are located in the deep pyramidal layer of the dorsal hippocampal CA1 region.

      Reviewer #3 (Public Review):

      Summary:

      In the present manuscript, the authors use a few minutes of voltage imaging of CA1 pyramidal cells in head-fixed mice running on a track while local field potentials (LFPs) are recorded. The authors suggest that synchronous ensembles of neurons are differentially associated with different types of LFP patterns, theta and ripples. The experiments are flawed in that the LFP is not "local" but rather collected in the other side of the brain, and the investigation is flawed due to multiple problems with the point process analyses. The synchrony terminology refers to dozens of milliseconds as opposed to the millisecond timescale referred to in prior work, and the interpretations do not take into account theta phase locking as a simple alternative explanation.

      We genuinely appreciate the reviewer’s feedback and acknowledge the concerns raised. However, we believe these concerns can be effectively addressed without undermining the validity of our conclusions. With this in mind, we respectfully disagree with the assessment that our experiments and investigation are flawed. Please allow us to address these concerns and offer additional context to support the validity of our study.

      Weaknesses:

      The two main messages of the manuscript indicated in the title are not supported by the data. The title gives two messages that relate to CA1 pyramidal neurons in behaving head-fixed mice: (1) synchronous ensembles are associated with theta (2) synchronous ensembles are not associated with ripples.

      There are two main methodological problems with the work:

      (1) Experimentally, the theta and ripple signals were recorded using electrophysiology from the opposite hemisphere to the one in which the spiking was monitored. However, both signals exhibit profound differences as a function of location: theta phase changes with the precise location along the proximo-distal and dorso-ventral axes, and importantly, even reverses with depth. And ripples are often a local phenomenon - independent ripples occur within a fraction of a millimeter within the same hemisphere, let alone different hemispheres. Ripples are very sensitive to the precise depth - 100 micrometers up or down, and only a positive deflection/sharp wave is evident.

      We appreciate the reviewer’s consideration regarding the collection of LFP from the contralateral hemisphere. While we acknowledge the limitation of this design, we believe that our findings still offer valuable insights into the dynamics of synchronous ensembles. Despite potential variations in theta phases with recording locations and depth, we find that the occurrence and amplitudes of theta oscillations are generally coordinated across hemispheres (Buzsaki et al., Neurosci., 2003). Therefore, the presence of prominent contralateral LFP theta around the times of synchronous ensembles in our study (see Figure 4A of the manuscript) strongly supports our conclusion regarding their association with theta oscillations, despite the collection of LFP from the opposite hemisphere.

      In addition, in our manuscript, we specifically mentioned that the “preferred phases” varied from session to session, likely due to the variability of recording locations (see Line 254-256). Therefore, we think that the reviewer’s concern regarding theta phase variability has already been addressed in the present manuscript.

      Regarding ripple oscillations, while we recognize that they can sometimes occur locally, the majority of ripples occur synchronously in both hemispheres (up to 70%, see Szabo et al., Neuron, 2022; Buzsaki et al., Neurosci., 2003). Therefore, using contralateral LFP to infer ripple occurrence on the ipsilateral side has been a common practice in the field, employed by many studies published in respectable journals (Szabo et al., Neuron, 2022; Terada et al., Nature, 2021; Dudok et al., Neuron, 2021; Geiller et al., Neuron, 2020). Furthermore, our observation that 446 synchronous ensembles during immobility do not co-occur with contralateral ripples, and the remaining 313 ensembles during locomotion are not associated with ripples, as ripples rarely occur during locomotion. Therefore, our conclusion that synchronous ensembles are not associated with ripple oscillations is supported by data.

      (2) The analysis of the point process data (spike trains) is entirely flawed. There are many technical issues: complex spikes ("bursts") are not accounted for; differences in spike counts between the various conditions ("locomotion" and "immobility") are not accounted for; the pooling of multiple CCGs assumes independence, whereas even conditional independence cannot be assumed; etc.

      We acknowledge the reviewer’s concern regarding spike train analysis. Indeed, complex bursts or different behavioral conditions can lead to differences in spike counts that could potentially affect the detection of synchronous ensembles. However, our jittering procedure (see Line 121-132) is designed to control for the variation of spike counts. Importantly, while the jittered spike trains also contain the same spike count variations, we found 7.8-fold more synchronous events in our data compared to jitter controls (see Figure 1G of the manuscript), indicating that these factors cannot account for the observed synchrony.

      To explicitly demonstrate that complex bursts cannot account for the observed synchrony, we have performed additional analysis to remove all latter spikes in bursts and only count the single and the first spikes of bursts. Importantly, we found that this procedure did not change the rate and size of synchronous ensembles, nor did it significantly alter the grand-average CCG (see Author response image 3). The results of this analysis explicitly rule out a significant effect of complex spikes on the analysis of synchronous ensembles.

      Author response image 3.

      Population synchrony remains after the removal of spikes in bursts. (A) The grand-average cross correlogram (CCG) was calculated using spike trains without latter spikes in bursts. The gray line represents the mean grand average CCG between reference cells and randomly selected cells from different sessions. (B) Pairwise comparison of the event rates of population synchrony between spike trains containing all spikes and spike trains without latter spikes in bursts. Bar heights indicate group means (n=10 segments, p=0.036, Wilcoxon signed-rank test). (C) Histogram of the ensemble sizes as percentages of cells participating in the synchronous ensembles.

      Beyond those methodological issues, there are two main interpretational problems: (1) the "synchronous ensembles" may be completely consistent with phase locking to the intracellular theta (as even shown by the authors themselves in some of the supplementary figures).

      We agree with the reviewer that the synchronous ensembles are indeed consistent with theta phase locking. However, it is important to note that theta phase locking alone does not necessarily imply population synchrony. In fact, theta phase locking has been shown to “reduce” population synchrony in a previous study (Mizuseki et al., 2014, Phil. Trans. R. Soc. B.). Thus, the presence of theta phase locking cannot be taken as a simple alternative explanation of the synchronous ensembles.

      To directly assess the contribution of theta phase locking to synchronous ensembles, we have performed a new analysis to randomize the specific theta cycles in which neurons spike, while keeping the spike phases constant. This manipulation disrupts spike co-occurrence while preserving theta phase locking, allowing us to test whether theta phase locking alone can explain the population synchrony, or whether spike co-occurrence in specific cycles is required. The grand-average CCG shows a much smaller peak compared to the original peak (Author response image 4A). Moreover, synchronous event rates show a 4.5-fold decrease in the randomized data compared to the original event rates (Author response image 4B). Thus, the new analysis reveals theta phase locking alone cannot account for the population synchrony.

      Author response image 4.

      Drastic reduction of population synchrony by randomizing spikes to other theta cycles while preserving the phases. (A) The grand-average cross correlogram (CCG) was calculated using original spike trains (black) and randomized spike trains where theta phases of the spikes are kept the same but spike timings were randomly moved to other theta cycles (red). (B) Pairwise comparison of the event rates of population synchrony between the original spike trains and randomized spike trains (n=10 segments, p=0.002, Wilcoxon signed-rank test). Bar heights indicate group means. ** p<0.01

      (2) The definition of "synchrony" in the present work is very loose and refers to timescales of 20-30 ms. In previous literature that relates to synchrony of point processes, the timescales discussed are 1-2 ms, and longer timescales are referred to as the "baseline" which is actually removed (using smoothing, jittering, etc.).

      Regarding the timescale of synchronous ensembles, we acknowledge that it varies considerably across studies and cell types. However, it is important to note that a timescale of dozens, or even hundreds of milliseconds is common for synchrony terminology in CA1 pyramidal neurons (see Csicsvari et al., Neuron, 2000; Harris et al., Science, 2003; Malvache et al., Science, 2016; Yagi et al., Cell Reports, 2023). In fact, a timescale of 20-30 ms is considered particularly important for information transmission and storage in CA1, as it matches the membrane time constant of pyramidal neurons, the period of hippocampal gamma oscillations, and the time window for synaptic plasticity. Therefore, we believe that this timescale is relevant and in line with established practices in the field.

    1. Author response:

      eLife Assessment

      This useful study integrates experimental methods from materials science with psychophysical methods to investigate how frictional stabilities influence tactile surface discrimination. The authors argue that force fluctuations arising from transitions between frictional sliding conditions facilitate the discrimination of surfaces with similar friction coefficients. However, the reliance on friction data obtained from an artificial finger, together with the ambiguous correlative analyses relating these measurements to human psychophysics, renders the findings incomplete.

      Our main goal with this paper was to show that the most common metric, i.e. average friction coefficient—widely used in tactile perception and device design—is fundamentally unsound, and to offer a secondary parameter that is compatible with the fact that human motion is unconstrained, leading to dynamic interfacial mechanics. In contrast with the summary assessment, we also note that the average friction coefficients in our study were not particularly similar, ranging from differences of 0.4 – 1, a typical range seen in most studies. We believe some of the comments originate from a misinterpretation of our statistically significant, but negative correlation between human results and friction coefficients – which leads to the spurious conclusion that nearly identical objects should be very easy to tell apart, thus supporting our central argument for the need of an alternative. We understand the Reviewers wanting to see that we can demonstrate that humans using instabilities in situ. This is seemingly reasonable, but we explain the significant challenges and fundamental unknowns to those experiments. However, we modified our title to reflect our focus on offering an alternative to the average coefficient of friction.

      We do not think it was feasible, at this stage, to demonstrate that humans use friction instabilities through direct manipulation and observation in human participants. In short, there are still several fundamental unknowns: (1) a decision-making model would need to be created, but it is unknown if tactile decision making follows other models, (2) it is further unknown what constitutes “tactile evidence”, though at our manuscript’s conclusion, we propose that friction instabilities are better suited for to be tactile evidence than the averaging of friction coefficients from a narrow range of human exploration (3) in the design of samples, from a friction mechanics and materials perspective, it is not at this point, possible to pre-program surfaces a priori to deliver friction instabilities and instead must be experimentally determined – especially when attempting to achieve this in controlled surfaces that do not create other overriding tactile cues, like macroscopic bumps or large differences in surface roughness. (4) Given that the basis for tactile percepts, like which object feels “rougher” or “smoother” is not sufficiently established and we have seen leads to confusion, it is necessary to use a 3-alternative forced choice task which avoids asking objects along a preset perceptual dimension – a challenge recognized by Reviewer 3. However, this would bring in issues of memory in the decision-making model. (5) The prior points are compounded by the fact that, we believe, tactile exploration must be performed in an unconstrained manner, i.e., without an apparatus generating motion onto a stationary finger. Work by Liu et al. (IEEE ToH, 2024) showed that recreating friction obtained during free exploration onto a stationary finger was uninterpretable by the participants, hinting at the importance of efference copies(1). We believe that each of the above-mentioned issues constitutes a significant advance in knowledge and would require discussion and dissemination with the community. Finally, one of our overarching goals is to create a consistent method to characterize surfaces, and given individual variability in human fingers and motion, a machine-based method that can rapidly, consistently, and sufficiently replicate tactile exploration is needed.

      Finally, we also justify our use of a mock finger to provide a method to characterize surfaces in tactile studies that other researchers could reasonably recreate, without creating a standard around individual humans, considering the variability in finger shape and motion during exploration. We do not believe this is an “either-or” argument, but rather that standardized methods to characterize surfaces and devices are greatly needed in the field. From these standardized methods, like surface roughness, some tabulated values of friction coefficient, or surface energy, etc., the current metrics to parameterize results are largely incapable of capturing the dynamic changes in forces expected during human tactile exploration.

      Our changes to the manuscript (Page 1 & SI Page 1, Title)

      “Alternatives to Friction Coefficient: Role of Frictional Instabilities for Fine Touch Perception”

      Reviewer 1 (Public review):

      Summary:

      In this paper, Derkaloustian et. al look at the important topic of what affects fine touch perception. The observations that there may be some level of correlation with instabilities are intriguing. They attempted to characterize different materials by counting the frequency (occurrence #, not of vibration) of instabilities at various speeds and forces of a PDMS slab pulled lengthwise over the material. They then had humans make the same vertical motion to discriminate between these samples. They correlated the % correct in discrimination with differences in frequency of steady sliding over the design space as well as other traditional parameters such as friction coefficient and roughness. The authors pose an interesting hypothesis and make an interesting observation about the occurrences of instability regimes in different materials while in contact with PDMS, which is interesting for the community to see in the publication. It should be noted that the finger is complex, however, and there are many factors that may be quite oversimplified with the use of the PDMS finger, and the consideration and discounting of other parameters are not fully discussed in the main text or SI. Most importantly, however, the conclusions as stated do not align with the primary summary of the data in Figure 2.

      Strengths:

      The strength of this paper is in its intriguing hypothesis and important observation that instabilities may contribute to what humans are detecting as differences in these apparently similar samples.

      We thank Reviewer 1 for their time on the manuscript, recognizing the approach we took, and offering constructive feedback. We believe that our conclusions, in fact, are supported by the primary summary of the data in Figure 2 but we believe that our use of R<sup>2</sup> could have led to misinterpretation. The trend with friction coefficient and percent correct was indeed statistically significant but was spurious because the slope was negative. In the revision, we add clarifying comments throughout, change from R<sup>2</sup> to r as to highlight the negative trend, and adjust the figures to better focus on friction coefficient.

      Finally, we added a new section to discuss the tradeoffs between using a real human finger versus a mock finger, and which situations may warrant the use of one or the other. In short, for our goal of characterizing surfaces to be used in tactile experiments, we believe a mock finger is more sustainable and practical than using real humans because human fingers are unique per participant, humans move their fingers at constantly changing pressures and velocities, and friction generated during free exploring human cannot be satisfactorily replicated by moving a sample onto a stationary finger. But, we do not disagree that for other types of experiments, characterizing a human participant directly may be more advantageous.

      Weaknesses:

      Comment 1 - The most important weakness is that the findings do not support the statements of findings made in the abstract. Of specific note in this regard is the primary correlation in Figure 2B between SS (steady sliding) and percent correct discrimination. Of specific note in this regard is the primary correlation in Figure 2B between SS (steady sliding) and percent correct discrimination. While the statistical test shows significance (and is interesting!), the R-squared value is 0.38, while the R-squared value for the "Friction Coefficient vs. Percent Correct" plot has an R-squared of 0.6 and a p-value of < 0.01 (including Figure 2B). This suggests that the results do not support the claim in the abstract: "We found that participant accuracy in tactile discrimination was most strongly correlated with formations of steady sliding, and response times were negatively correlated with stiction spikes. Conversely, traditional metrics like surface roughness or average friction coefficient did not predict tactile discriminability."

      We disagree that the trend with friction coefficient suggests the results do not support the claim because the correlation was found to be negative. However, we could have made the comparison more apparent and expanded on this point, given its novelty.

      While the R<sup>2</sup> value corresponding to the “Friction Coefficient vs. Percent Correct” plot is notably higher, our results show that the slope is negative, which would be statistically spurious. This is because a negative correlation between percent correct (accuracy in discriminating surfaces) and difference in friction coefficient means that the more similar two surfaces are (by friction coefficient), the easier it would be for people to tell them apart. That is, it incorrectly concludes that two identical surfaces would be much easier to tell apart than two surfaces with greatly different friction coefficients.

      This is counterintuitive to nearly all existing results, but we believe our samples were well-positioned to uncover this trend by minimizing variability, by controlling multiple physical parameters in the samples, and that the friction coefficient — typically calculated in the field as an average friction coefficient — ignores all the dynamic changes in forces present in elastic systems undergoing mesoscale friction, i.e., human touch, as seen in Fig. 1 in a mock finger and Fig. 3 in a real finger. By demonstrating this statistically spurious trend, we believe this strongly supports our premise that an alternative to friction coefficient is needed in the design of tactile psychophysics and haptic interfaces.

      We believe that this could have been misinterpreted, so we took several steps to improve clarity, given the importance of this finding: we separated the panel on friction coefficient to its own panel, we changed from R<sup>2</sup> to r throughout, and we added clarifying text. We also added a small section focusing on this spurious trend.

      Our changes to the manuscript (Page 10)

      “To compare the value of looking at frictional instabilities, we also performed GLMM fits on common approaches in the field, like a friction coefficient or material property typically used in tactile discrimination, shown in Fig. 2D-E. Interestingly, in Fig. 2D, we observed a spurious, negative correlation between friction coefficient (typically and often problematically simplified as across all tested conditions) and accuracy (r = -0.64, p < 0.01); that is, the more different the surfaces are by friction coefficient, the less people can tell them apart. This spurious correlation would be the opposite of intuition, and further calls into question the common practice of using friction coefficients in touch-related studies. The alternative, two-term model which includes adhesive contact area for friction coefficient(29) was even less predictive (see Fig. S6A of SI). We believe such a correlation could not have been uncovered previously as our samples are minimal in their physical variations. Yet, the dynamic changes in force even within a single sample are not considered, despite being a key feature of mesoscale friction during human touch.

      We investigate different material properties in Fig. 2E. Differences in average roughness R<sub>a</sub> (or other parameters, like root mean square roughness R<sub>rms</sub> (Fig. S6A of SI) did not show a statistically significant correlation to accuracy. Though roughness is a popular parameter, correlating any roughness parameter to human performance here could be moot: the limit of detecting roughness differences has previously been defined as 13 nm on structured surfaces33 and much higher for randomly rough surfaces,(46) all of which are magnitudes larger than the roughness differences between our surfaces. The differences in contact angle hysteresis – as an approximation of the adhesion contributions(47) – do not present any statistically significant effects on performance.”

      Comment 2, Part 1

      Along the same lines, other parameters that were considered such as the "Percent Correct vs. Difference in Sp" and "Percent Correct vs. Difference in SFW" were not plotted for consideration in the SI. It would be helpful to compare these results with the other three metrics in order to fully understand the relationships.

      We have added these plots to the SI. We note that we had checked these relationships and discussed them briefly, but did not include the plot. The plots show that the type of instability was not as helpful as its presence or absence.

      Our changes to the manuscript (Page 9)

      “Furthermore, a model accounting for slow frictional waves alone specifically shows a significant, negative effect on performance (p < 0.01, Fig. S5 of SI), suggesting that in these samples and task, the type of instability was not as important.”

      Added (SI Page 4)

      “and no correlation between accuracy and stiction spikes (Fig. S5).”

      Comment 2, Part 2

      Other parameters such as stiction magnitude and differences in friction coefficient over the test space could also be important and interesting.

      We agree these are interesting and have thought about them. We are aware that others, like Gueorguiev et al., have studied stiction magnitudes, and though there was a correlation, the physical differences in surface roughness (glass versus PMMA) investigated made it unclear if these could be generalized further(2). We are unsure how to proceed here with a satisfactory analysis of stiction magnitude, given that stiction spikes are not always generated. In fact, Fig. 1 shows that for many velocities and pressures, they do not form. However, we offer some speculation on why stiction spikes may be overrepresented in the literature because:

      (1) They are prone to being created if the finger was loaded for a long time onto a surface prior to movement, thus creating adhesion by contact aging which is unlike active human exploration. We avoid this by discarding the first pull in our measurements, and is a standard practice in mechanical characterization if contact aging needs to be avoided.

      (2) The ranges of velocities and pressures explored were small.

      (3) In an effort to generate strong tactile stimuli, highly adhesive or rough surfaces are used.

      (4) They are visually distinctive on a plot, but we are unaware of any mechanistic reason that mechanoreceptors would be extremely sensitive to this low frequency event over other signals.

      In ongoing work, however, we are always cognizant that if stiction spikes are a dominant factor, then a secondary analysis on their magnitude would be important.

      We interpret “difference in friction coefficient over the test space” to be, for a single surface, like C4, to find the highest average friction for a condition of single velocity and mass and subtract that from the lowest average friction for a condition of single velocity and mass. We calculated the difference in friction coefficient in the typical manner of the field, by averaging all data collected at all velocities and masses and assigning a single value for all of a surface, like C4. We had performed this, and have the data, but we are wary of overinterpreting secondary and tertiary metrics because they do not have any fundamental basis in traditional tribology, and this value, if used by humans, would suggest that they rapidly explore a large parameter space to find a “maximum” and “minimum” friction. Furthermore, the range in friction across the test space, after averaging, may in fact, be smaller than the range of friction in a single measurement. For example, in Fig. 1B, the friction coefficient can be calculated by dividing the data by the normal force ([applied mass + 6 g finger] × gravity). The friction coefficient in a single run varies widely, as expected.

      Fig. 2D shows a GLMM fit between percent correct responses across our pairs and the differences in friction coefficient for each pair, where we see a spurious negative correlation. As we had the data of all average friction coefficients for each condition for a given material, we also looked at the difference in maximum and minimum friction coefficients. For our tested pairs, these differences also lined up on a statistically significant, negative GLMM fit (r = -0.86, p < 0.005). However, the values for a given surface can vary drastically, with an interquartile range of 1.20 to 2.09 on a single surface. We fit participant accuracy to the differences in these IQRs across pairs. This also led to a negative GLMM fit (r = -0.65, p < 0.05). However, we are hesitant to add this to the manuscript for the reasons stated previously.

      Comment 3, Part 1

      Beyond this fundamental concern, there is a weakness in the representativeness of the PDMS finger, the vertical motion, and the speed of sliding to real human exploration.

      Overall, this is a continuous debate that we think offers two solutions. There is always a tradeoff between using a synthetic model of a finger versus a real human finger, and there is a place for both models. That is, while our mock finger will be more successful the closer it is to a human finger, it is not our goal to fully replace a human finger, rather our goal is to provide a method of characterizing surfaces that is indeed relevant on the length scale of human touch.

      The usefulness of the mock finger is in isolating the features of each surface that is independent of human variability, i.e., instabilities that form without changing loading conditions between sliding motions or even within one sliding motion. Of course, with this method, we still require confirmation of these features still forming during human exploration, which we show in Fig. 3.

      We believe that this method of characterizing surfaces at the mesoscale will ultimately lead to more successful human studies on tactile perception. Currently, and as shown in the paper, characterizing surfaces through traditional techniques, such as a commercial tribometer (friction coefficient, using a steel or hard metal ball), roughness (via atomic force microscopy or some other metrology), surface energy are less predictive. Thus, we believe this mock finger is stronger than the current state-of-the-art characterizing surfaces (we are also aware of a commercial mock finger company, but we were unable to purchase or obtain an evaluation model).

      One of the main – and severe – limitations of using a human finger is that all fingers are different, meaning any study focusing on a particular user may not apply to others or be recreated easily by other researchers. We cannot set a standard for replication around a real human finger as that participant may no longer be available, or willing to travel the world as a “standard”. Furthermore, the method in which changes their pressures and velocities is different. We note that this is a challenge unique to touch perception – how an object is touched changes the friction generated, and thus the tactile stimulus generated, whereas a standardized stimulus is more straightforward for light or sound.

      However, we do emphasize that we have strongly considered the balance between feasibility and ecological validity in the design of a mock finger. We have a mock finger, with the three components of stiffness of a human finger (more below). Furthermore, we have also successfully used this mock finger in correlations with human psychophysics in previous work, where findings from our mechanical experiments were predictive of human performance(3-6).

      Our changes to the manuscript Added (Page 2-3)

      “Mock finger as a characterization tool

      In this work, we use a mechanical setup with a PDMS mock finger to derive tactile predictors from controlled friction traces alternative to average friction coefficients. While there is a tradeoff in selecting a synthetic finger over a more accurate, real human finger in modeling touch, our aim to design a method of mesoscale surface characterization for more successful studies on tactile perception cannot be fulfilled using one human participant as a standard. We believe that with sufficient replication of surface and bulk properties as well as contact geometry, and controlled friction measurements collected at loading conditions observed during a tactile discrimination task, we can isolate unique frictional features of a set of surfaces that do not arise from human-to-human variability.

      The major component of a human finger, by volume, is soft tissue (~56%)(22), resulting in an effective modulus close to 100 kPa(23,24). In order to achieve this same softness, we crosslink PDMS in a 1×1×5 cm mold at a 30:1 elastomer:crosslinker ratio. However, two more features impart increased stiffness in a human finger. Most of this added rigidity is derived from the bone at the fingertip, the distal phalanx(23–25), which we mimic with an acrylic bone within our PDMS network. The stratum corneum, the stiffer, glassier outer layer of skin(26), is replicated with the surface of the mock finger glassified, or further crosslinked, after 8 hours of UV-Ozone treatment(27). This treatment also modifies the surface properties of the native PDMS to align with those of a human finger more closely. It minimizes the viscoelastic tack at the surface, resulting in a comparable non-sticky surface. At least one day after treatment, the finger surface returns to moderate hydrophilicity (~60º), as is typically observed for a real finger(28).

      The initial contact area formed before a friction trace is collected is a rectangle of 1×1 cm. While this shape is not entirely representative of a human finger with curves and ridges, human fingers flatten out enough to reduce the effects of curvature with even very light pressures(28–30). This implies that regardless of finger pressure, the contact area is largely load-independent, which is more accurately replicated with a rectangular mock finger. It is still a challenge to control pressure distribution with this planar interface, but non-uniform pressures are also expected during human exploration.

      Lastly, we consider fingerprints vs. flat fingers. A key finding of our previous work is that while fingerprints enhanced frictional dynamics at certain conditions, key features were still maintained with a flat finger.7 Furthermore, for some loading conditions, the more amplified signals could also result in more similar friction traces for different surfaces. We have continued to use flat fingers in our mechanical experiments, and have observed good agreement between these friction traces and human experiments(7,8,21,31).”

      (Page 3-4, Materials and Methods)

      “Mock Finger Preparation

      Friction forces across all six surfaces were measured using a custom apparatus with a polydimethylsiloxane (PDMS, Dow Sylgard 184) mock finger that mimics a human finger’s

      mechanical properties and contact mechanics while exploring a surface relatively closely(7,8). PDMS and crosslinker were combined in a 30:1 ratio to achieve a stiffness of 100 kPa comparable to a real finger, then degassed in a vacuum desiccator for 30 minutes. We are aware that the manufacturer recommended crosslinking ratio for Sylgard 184 is 10:1 due to potential uncrosslinked liquid residues(32), but further crosslinking concentrated at the surface prevents this. The prepared PDMS was then poured into a 1×1×5 cm mold also containing an acrylic 3D-printed “bone” to attach applied masses on top of the “fingertip” area contacting a surface during friction testing. After crosslinking in the mold at 60ºC for 1 hour, the finger was treated with UV-Ozone for 8 hours out of the mold to minimize viscoelastic tack.

      Mechanical Testing

      A custom device using our PDMS mock finger was used to collect macroscopic friction force traces replicating human exploration(7,8). After placing a sample surface on a stage, the finger was lowered at a slight angle such that an initial 1×1 cm rectangle of “fingertip” contact area could be established. We considered a broad range of applied masses (M \= 0, 25, 75, and 100 g) added onto the deadweight of the finger (6 g) observed during a tactile discrimination task. The other side of the sensor was connected to a motorized stage (V-508 PIMag Precision Linear Stage, Physikinstrumente) to control both displacement (4 mm across all conditions) and sliding velocity (v \= 5, 10, 25, and 45 mm s<sup>-1</sup>). Forces were measured at all 16 combinations of mass and velocity via a 250 g Futek force sensor (k \= 13.9 kN m<sup>-1</sup>) threaded to the bone, and recorded at an average sampling rate of 550 Hz with a Keithley 7510 DMM digitized multimeter. Force traces were collected in sets of 4 slides, discarding the first due to contact aging. Because some mass-velocity combinations were near the boundaries of instability phase transitions, not all force traces at these given conditions exhibited similar profiles.

      Thus, three sets were collected on fresh spots for each condition to observe enough occurrences of multiple instabilities, at a total of nine traces per combination for each surface.”

      Added References (Page 13)

      M. Murai, H.-K. Lau, B. P. Pereira and R. W. H. Pho, J. Hand Surg., 1997, 22, 935–941.

      A. Abdouni, M. Djaghloul, C. Thieulin, R. Vargiolu, C. Pailler-Mattei and H. Zahouani, R. Soc. Open Sci., DOI:10.1098/rsos.170321.

      P.-H. Cornuault, L. Carpentier, M.-A. Bueno, J.-M. Cote and G. Monteil, J. R. Soc. Interface, DOI:10.1098/rsif.2015.0495.

      K. Qian, K. Traylor, S. W. Lee, B. Ellis, J. Weiss and D. Kamper, J. Biomech., 2014, 47, 3094– 3099.

      Y. Yuan and R. Verma, Colloids Surf. B Biointerfaces, 2006, 48, 6–12.

      Y.-J. Fu, H. Qui, K.-S. Liao, S. J. Lue, C.-C. Hu, K.-R. Lee and J.-Y. Lai, Langmuir, 2010, 26, 4392–4399.

      Comment 3, Part 2

      “The real finger has multiple layers with different moduli. In fact, the stratum corneum cells, which are the outer layer at the interface and determine the friction, have a much higher modulus than PDMS. The real finger has multiple layers with different moduli. In fact, the stratum corneum cells, which are the outer layer at the interface and determine the friction, have a much higher modulus than PDMS.

      We have approximated the softness of the finger with 100 kPa crosslinked PDMS, which is close to what has been reported for the bulk of a human fingertip(8,9). However, as mentioned in the Materials and Methods, there are two additional features of the mock finger that impart greater strength. The PDMS surrounds a rigid, acrylic bone comparable to the distal phalanx, which provides an additional layer of higher modulus(10). Additionally, the 8-hour UV-Ozone treatment decreases the viscoelastic tack of the pristine PDMS by glassifying, or further crosslinking the surface of the finger(11), therefore imparting greater stiffness at the surface similar to the contributions of the stratum corneum, along with a similar surface energy(12). This technique is widely used in wearables(13), soft robotics(14), and microfluidics(15) to induce both these material changes. Additionally, the finger is used at least a day after UV-Ozone treatment is completed in order for the surface to return to moderate hydrophilicity, similar to the outermost layer of human skin(16).

      Comment 3, Part 3

      In addition, the slanted position of the finger can cause non-uniform pressures across the finger. Both can contribute to making the PDMS finger have much more stick-slip than a real finger.

      To ensure that there is minimal contribution from the slanted position of the finger, an initial contact area of 1×1 cm is established before sliding and recording friction measurements. As the PDMS finger is a soft object, the portion in contact with a surface flattens and the contact area remains largely unchanged during sliding. Any additional stick-slip after this alignment step is caused by contact aging at the interface, but the first trace we collect is always discarded to only consider stick-slip events caused by surface chemistry. We recognize that it is difficult to completely control the pressure distribution due to the planar interface, but this is also expected when humans freely explore a surface.

      Comment 3, Part 4

      In fact, if you look at the regime maps, there is very little space that has steady sliding. This does not represent well human exploration of surfaces. We do not tend to use a force and velocity that will cause extensive stick-slip (frequent regions of 100% stick-slip) and, in fact, the speeds used in the study are on the slow side, which also contributes to more stick-slip. At higher speeds and lower forces, all of the materials had steady sliding regions.

      We are not aware of published studies that extensively show that humans avoid stickslip regimes. In fact, we are aware familiar with literature where stiction spike formation is suppressed – a recent paper by AliAbbasi, Basdogan et. al. investigates electroadhesion and friction with NaCl solution-infused interfaces, resulting in significantly steadier forces(17). We also directly showed evidence of instability formation that we observed during human exploration in Fig. 3B-C. These dynamic events are common, despite the lack of control of normal forces and sliding velocities. We also note that Reviewer 1, Comment 2, was suggesting that we further explore possible trends from parameterizing the stiction spike.

      We note that many studies have often not gone at the velocities and masses required for stiction spikes – even though these masses and velocities would be routinely seen in free exploration – this is usually due to constraints of equipment(18). Sliding events during human free exploration of surfaces can exceed 100 mm/s for rapid touches. However, for the surfaces investigated here, we observe that large regions of stick-slip can emerge at velocities as low as 5 mm/s depending on the applied load. The incidence of steady sliding appears more dependent on the applied mass, with almost no steady sliding observed at or above 75 g. Indeed, the force categorization along our transition zones is the main point of the paper.

      Comment 3, Part 5

      Further, on these very smooth surfaces, the friction and stiction are more complex and cannot dismiss considerations such as finger material property change with sweat pore occlusion and sweat capillary forces. Also, the vertical motion of both the PDMS finger and the instructed human subjects is not the motion that humans typically use to discriminate between surfaces.

      We did not describe the task sufficiently. Humans were only given the instruction to slide their finger along a single axis from top to bottom of a sample, not vertical as in azimuthal to gravity. We have updated our wording in the manuscript to reflect this.

      Our changes to the manuscript (Page 4)

      “Participants could touch for as long as they wanted, but were asked to only use their dominant index fingers along a single axis to better mimic the conditions for instability formation during mechanical testing with the mock finger.”

      (Page 11)

      “The participant was then asked to explore each sample simultaneously, and ran over each surface in strokes along a single axis until the participant could decide which of the two had “more friction”.”

      Comment 3, Part 6

      Finally, fingerprints may not affect the shape and size of the contact area, but they certainly do affect the dynamic response and detection of vibrations.

      We are aware of the nuance. Our previous work on the role of fingerprints on friction experienced by a PDMS mock finger showed enhanced signals with the incorporation of ridges on the finger and used a rate-and-state model of a heterogenous, elastic body to find corresponding trends (though there is no existing model of friction that can accurately model experiments on mesoscale friction)(7). The key conclusion was that a flat finger still preserved key dynamic features, and the presence of stronger or more vibrations could result in more similar forces for different surfaces depending on the sliding conditions.

      This is also in the context that we are seeking to provide a reasonable and experimentally accessible method to characterize surfaces, which will always be better as we get closer in replicating a true human finger. But our goal here was to replicate the finger sufficiently for use in human studies. We believe the more appropriate metric of success is if the mock finger is more successful than replacing traditional characterization experiments, like friction coefficient, roughness, surface energy, etc.

      Comment 4

      This all leads to the critical question, why are friction, normal force, and velocity not measured during the measured human exploration and in a systematic study using the real human finger? The authors posed an extremely interesting hypothesis that humans may alter their speed to feel the instability transition regions. This is something that could be measured with a real finger but is not likely to be correlated accurately enough to match regime boundaries with such a simplified artificial finger.

      We are excited that our manuscript offers a tractable manner to test the hypothesis that tactile decision-making models use friction instabilities as evidence. However, we lay out the challenges and barriers, and how the scope of this paper will lead us in that direction. We also clarify that our goals are to provide a method to characterize samples to better design tactile interfaces in haptics or in psychophysical experiments and raise awareness that the common methods of sample characterization in touch by an average friction coefficient or roughness is fundamentally unsound.

      In short, in our view, to further support our findings on instabilities would require answering:

      (1) Which one, or combination of, of the multiple swipes that people make responsible for a tactile decision? (The need for a decision-making model)

      (2) Establish what is, or may be, tactile evidence.

      (3) Establish tactile decision-making models are similar or different than existing decision-making models.

      (4) Test the hypothesis, in these models, that friction instabilities are evidence, and not some other unknown metric. This requires design samples that vary in the amount of evidence generated, but this evidence cannot be controlled directly. Rather, the samples indirectly vary evidence by how likely it is for a human to generate different types of friction instabilities during standard exploration.

      (5) Design a task that does not require the use of subjective tactile descriptors, like “which one feels rougher”, which we see cause confusion in participants, which will likely require accounting for memory effects.

      We elaborate these points below:

      To successfully perform this experiment, we note that freely exploring humans make multiple strokes on a surface. Therefore, we would need to construct a decision-making model. It has not yet been demonstrated whether tactile decision making follows visual decision making, but perhaps to start, we can assume it does. Then, in the design of our decision-making paradigm, we immediately run into the problem: What is tactile evidence?

      From Fig. 3C, we already can see that identifying evidence is challenging. Prior to this manuscript, people may have chosen the average force, or the highest force. Or we may choose the average friction force. Then, after deciding on the evidence, we need to find a method to manipulate the evidence, i.e., create samples or a machine that causes high friction, etc. We show that during the course of human touch, due to the dynamic nature of friction, the average can change a large amount and sample design becomes a central barrier to experiments. Others may suggest immobilizing the finger and applying a known force, but given how much friction changes with human exploration, there is no known method to make a machine recreate temporally and spatially varying friction forces during sliding onto a stationary finger. Finally, perhaps most importantly, in addition to mechanical challenges, a study by Liu, Colgate et al. showed that even if they recorded the friction (2D) of a finger exploring a surface and then replicated the same friction forces onto a finger, the participant could not determine which surface the replayed friction force was supposed to represent.1 This supports that the efference copy is important, that the forces in response to expected motion are important to determine friction. Finally, there is no known method to design instabilities a priori. They must be found through experiments. Especially since if we were to introduce, say a bump or a trough, then we bring in confounding variables to how participants tell surfaces apart.

      Furthermore, even if we had some consistent method to create tactile “evidence”, the paradigm also deserves some consideration. In our experience, the 3-AFC task we perform is important because the vocabulary for touch has not been established. That is, in 3-AFC, by asking to determine which one sample is unlike the others, we do not have to ask the participant questions like “which one is rougher” or “which one has less friction”. In contrast, 2-AFC, which is better for decision-making models because it does not include memory, requires the asking of a perceptual question like: “which one is rougher?”. In our ongoing work, taking two silane coatings, we found that participants could easily identify which surface is unlike the others above chance in a 3-AFC, but participants, even within their own trials, could not consistently identify one silane as perceptually “rougher” by 2-AFC. To us, this calls into question the validity of tactile descriptors, but is beyond the scope of this manuscript.

      This is not our only goal, but in the context of human exploration, in this manuscript here, we believed it was important to identify a mechanical parameter that was consistent with how humans explore surfaces, but was also a parameter that could characterize to some consistent property of a surface – irrespective of whether a human was touching it. We thought that designing human decision-making models and paradigms around the friction coefficient would not be successful.

      Given the scope of these challenges, we do not think it would be possible to establish these conceptual sequences in a single manuscript.

      Reviewer 2 (Public review):

      Summary:

      In this paper, the authors want to test the hypothesis that frictional instabilities rather than friction are the main drivers for discriminating flat surfaces of different sub-nanometric roughness profiles.

      They first produced flat surfaces with 6 different coatings giving them unique and various properties in terms of roughness (picometer scale), contact angles (from hydrophilic to hydrophobic), friction coefficient (as measured against a mock finger), and Hurst exponent.

      Then, they used those surfaces in two different experiments. In the first experiment, they used a mock finger (PDMS of 100kPA molded into a fingertip shape) and slid it over the surfaces at different normal forces and speeds. They categorized the sliding behavior as steady sliding, sticking spikes, and slow frictional waves by visual inspection, and show that the surfaces have different behaviors depending on normal force and speed. In a second experiment, participants (10) were asked to discriminate pairs of those surfaces. It is found that each of those pairs could be reliably discriminated by most participants.

      Finally, the participant's discrimination performance is correlated with differences in the physical attributes observed against the mock finger. The authors found a positive correlation between participants' performances and differences in the count of steady sliding against the mock finger and a negative correlation between participants' reaction time and differences in the count of stiction spikes against the mock finger. They interpret those correlations as evidence that participants use those differences to discriminate the surfaces.

      Strengths:

      The created surfaces are very interesting as they are flat at the nanometer scale, yet have different physical attributes and can be reliably discriminated.”

      We thank Reviewer 2 for their notes on our manuscript. The responses below address the reviewer’s comments and recommendations for revised work.

      Weaknesses:

      Comment 1

      In my opinion, the data presented in the paper do not support the conclusions. The conclusions are based on a correlation between results obtained on the mock finger and results obtained with human participants but there is no evidence that the human participants' fingertips will behave similarly to the mock finger during the experiment. Figure 3 gives a hint that the 3 sliding behaviors can be observed in a real finger, but does not prove that the human finger will behave as the mock finger, i.e., there is no evidence that the phase maps in Figure 1C are similar for human fingers and across different people that can have very different stiffness and moisture levels.

      The mechanical characterization conducted with the mock finger seeks to extract significant features of friction traces of a set of surfaces to use as predictors of tactile discriminability. The goal is to find a consistent method to characterize surfaces for use in tactile experiments that can be replicated by others and used prior to any human experiments. However, in the overall response and in a response to a similar comment by Reviewer 1, we also explain why we believe experiments on humans to establish this fact is not yet reasonable.

      Comment 2

      I believe that the authors collected the contact forces during the psychophysics experiments, so this shortcoming could be solved if the authors use the actual data, and show that the participant responses can be better predicted by the occurrence of frictional instabilities than by the usual metrics on a trial by trial basis, or at least on a subject by subject basis. I.e. Poor performers should show fewer signs of differences in the sliding behaviors than good performers.

      To fully implement this, a decision-making model is necessary because, as a counter example, a participant could have generated 10 swipes of SFW and 1 swipe of a Sp, but the Sp may have been the most important event for making a tactile decision. This type of scenario is not compatible with the analysis suggested — and similar counterpoints can be made for other types of seemingly straightforward analysis.

      While we are interested and actively working on this, the study here is critical to establish types of evidence for a future decision-making model. We know humans change their friction constantly during real exploration, so it is unclear which of these constantly changing values we should input into the decision making model, and the future challenges we anticipate are explained in Comment 1.

      Comment 3

      The sample size (10) is very small.

      We recognize that, with all factors being equal, this sample size is on the smaller end. However, we emphasize the degree of control of samples is far above typical, with minimal variations in sample properties such as surface roughness, and every sample for every trial was pristine. Furthermore, the sample preparation (> 300 individual wafers were used) and cost became a factor. Although not typically appropriate, and thus not included in the manuscript, a post-hoc power analysis for our 100 trials of our pair that was closest to chance, P4, (53%, closest to chance at 33%) showed a power of 98.2%, suggesting that the study was appropriately powered.

      Reviewer 2 (Recommendations for the authors):

      Comment 1

      Differences in SS and Sp (Table 2) are NOT physical or mechanical differences but are obtained by counting differences in the number of occurrences of each sliding behavior. It is rather a weird choice.

      We disagree that differences in SS and Sp are not physical or mechanical, as these are well-established phenomena in the soft matter and tribology literature(19-21). These are known as “mechanical instabilities” and generated due to the effects of two physical phenomena: the elasticity of the finger (which is constant in our mechanical testing) and the friction forces present (which change per sample type). The motivation behind using these different shapes is that the instabilities, in some conditions, can be invariant to external factors like velocity. This would be quite advantageous for human exploration because, unlike friction coefficient, which changes with nearly any factor, including velocity and mass, the instabilities being invariant to velocity would mean that we are accurately characterizing a unique identifier of the surface even though velocity may be variable.

      This “weird choice” is the central innovation of this paper. This choice was necessary because we demonstrated that the common usage of friction coefficient is fundamentally flawed: we see that friction coefficient suggests that surface which are more different would feel more similar – indeed the most distinctive surfaces would be two surfaces that are identical, which is clearly spurious. One potential explanation for why we were able to see this is effect is because our surfaces have similar (< 0.6 nm variability) roughness, removing potential confounding factors, and this type of low roughness control has not been used in tactile studies to the best of our knowledge.

      Comment 2

      Figures 2B-C: why are the x-data different than Table 2?

      The x-data in Fig. 2B-C are the absolute differences in the number of occurrences measured for a given instability type or material property out of 144 pulls. Modeling the human participant results in our GLMMs required the independent variables to be in this form rather than percentages. We initially chose to list percent differences in Table 2 to highlight the ranges of differences instead of an absolute value, but have added both for clarity.

      Our changes to the manuscript (Page 7)

      “To determine if humans can detect these three different instabilities, we selected six pairs of surfaces to create a broad range of potential instabilities present across all three types. These are summarized in Table 2, where the first column for each instability is the difference in occurrence of that instability formed between each pair, and the second is the percent difference.”

      Comment 3

      "We constructed a set of coated surfaces with physical differences which were imperceptible by touch but created different types of instabilities based on how quickly a finger is slid and how hard a human finger is pressed during sliding." Yet, in your experiment, participants could discriminate them, so this is incoherent.

      To clarify the point, macroscopic objects can differ in physical shape and in chemical composition. What we meant was that the physical differences, i.e., roughness, were below a limit (Skedung et al.) that participants, without a coating, would not be able to tell these apart(22). Therefore, the reason people could tell our surfaces apart was due to the chemical composition of the surface, and not any differences in roughness or physical effects like film stiffness (due to the molecular-scale thinness of the surface coatings, they are mechanically negligible). However, we concede that at the molecular scale, the traditional macroscopic distinction between physical and chemical is blurred.

      We have made minor revisions to the wording in the abstract. We clarify that the surface coatings had physical differences in roughness that were smaller than 0.6 nm, which based purely on roughness, would not be expected to be distinguishable to participants. Therefore, the reason participants can tell these surfaces apart is due to differences in friction generated by chemical composition, and we were able to minimize contributions from physical differences in the sample our study.

      Our changes to the manuscript (Page 1, Abstract)

      “We constructed a set of coated surfaces with minimal physical differences that by themselves, are not perceptible to people, but instead, due to modification in surface chemistry, the surfaces created different types of instabilities based on how quickly a finger is slid and how hard a human finger is pressed during sliding.”

      Reviewer 3 (Public review):

      Strengths:  

      The paper describes a new perspective on friction perception, with the hypothesis that humans are sensitive to the instabilities of the surface rather than the coefficient of friction. The paper is very well written and with a comprehensive literature survey.

      One of the central tools used by the author to characterize the frictional behavior is the frictional instabilities maps. With these maps, it becomes clear that two different surfaces can have both similar and different behavior depending on the normal force and the speed of exploration. It puts forward that friction is a complicated phenomenon, especially for soft materials.

      The psychophysics study is centered around an odd-one-out protocol, which has the advantage of avoiding any external reference to what would mean friction or texture for example. The comparisons are made only based on the texture being similar or not.

      The results show a significant relationship between the distance between frictional maps and the success rate in discriminating two kinds of surface.”

      We thank Reviewer 3 for their notes and interesting discussion points on our manuscript. Below, we address the reviewer’s feedback and comments on related works.

      Weaknesses:

      Comment 1

      The main weakness of the paper comes from the fact that the frictional maps and the extensive psychophysics study are not made at the same time, nor with the same finger. The frictional maps are produced with an artificial finger made out of PDMS which is a poor substitute for the complex tribological properties of skin.

      A similar comment was made by Reviewers 1 and 2 and parts are replicated below. We are not claiming that our PDMS fingers are superior to real fingers, but rather, we cannot establish standards in the field by using real human fingers that vary between subjects and researchers. We believe the mock finger we designed is a reasonable mimic of the human finger by matching surface energy, heterogeneous mechanical structure, and the ability to test multiple physiologically relevant pressures and sliding velocities.

      We achieve a heterogeneous mechanical structure with the 3 primary components of stiffness of a human finger. The effective modulus of ~100 kPa, from soft tissue,8,9 is obtained with a 30:1 ratio of PDMS to crosslinker. The PDMS also surrounds a rigid, acrylic bone comparable to the distal phalanx, which provides an additional layer of higher modulus.10 Additionally, the 8-hour UV-Ozone treatment decreases the viscoelastic tack of the pristine PDMS by glassifying, or further crosslinking the surface of the finger,11 therefore imparting greater stiffness at the surface similar to the contributions of the stratum corneum, along with a similar surface energy.12 The finger is used at least a day after UV-Ozone treatment is completed in order for the surface to return to moderate hydrophilicity, similar to the outermost layer of human skin.16 We also discuss the shape of the contact formed. To ensure that there is minimal contribution from the slanted position of the finger, an initial contact area of 1×1 cm is established before sliding and recording friction measurements. As the PDMS finger is a soft object, the portion in contact with a surface flattens and the contact area remains largely unchanged during sliding. We recognize that it is difficult to completely control the pressure distribution due to the planar interface, but this variation is also expected when humans freely explore a surface. Finally, we consider flat vs. fingerprinted fingers. Our previous work on the role of fingerprints on friction experienced by a PDMS mock finger showed enhanced signals with the incorporation of ridges on the finger and used a rate-andstate model of a heterogenous, elastic body to find corresponding trends.7 The key conclusion was that a flat finger still preserved key dynamic features, and the presence of stronger or more vibrations could result in more similar forces for different surfaces depending on the sliding conditions. We note that we have subsequently used the controlled mechanical data collected with this flat mock finger in correlations with human psychophysics in previous work, where findings from our mechanical experiments were predictive of human performance.3–6 Ultimately, we see from our prior work and here that, despite the drawbacks of our mock finger, it outperforms other standard characterization technique in providing information about the mesoscale that correlates to tactile perception. We have added these details to the manuscript.

      We also note that an intermediate option, replicating real fingers, even in a mold, may also inadvertently limit trends from characterization to a specific finger. One of the main – and severe – limitations of using a human finger is that all fingers are different, meaning any study focusing on a particular user may not apply to others or be recreated easily by other researchers. We cannot set a standard for replication around a real human finger as that participant may no longer be available, or willing to travel the world as a “standard”. Furthermore, the method in which a single person changes their pressures and velocities as they touch a surface is highly variable. We also note that in the Summary Response, we noted that a study by Colgate et al. (IEEE ToH 2024) demonstrated that efference copies may be important, and thus constraining a human finger and replaying the forces recorded during free exploration will not lead to the participant identifying a surface with any consistency. Thus, it is important to allow humans to freely explore surfaces, but creates nearly limitless variability in friction forces.

      This is also against the backdrop that we are seeking to provide a method to characterize surfaces, which will be aided as we get closer in replicate a true human finger. Indeed, the more features we replicate, the more successful the mechanical data will be in correlating to tactile distinguishability. But reasonably, our success would be in replacing traditional characterization experiments, not in recreating the forces of an arbitrary human finger.

      Our changes to the manuscript Added (Page 2-3)

      “Mock finger as a characterization tool

      In this work, we use a mechanical setup with a PDMS mock finger to derive tactile predictors from controlled friction traces alternative to average friction coefficients. While there is a tradeoff in selecting a synthetic finger over a more accurate, real human finger in modeling touch, our aim to design a method of mesoscale surface characterization for more successful studies on tactile perception cannot be fulfilled using one human participant as a standard. We believe that with sufficient replication of surface and bulk properties as well as contact geometry, and controlled friction measurements collected at loading conditions observed during a tactile discrimination task, we can isolate unique frictional features of a set of surfaces that do not arise from human-to-human variability.

      The major component of a human finger, by volume, is soft tissue (~56%)(22), resulting in an effective modulus close to 100 kPa(23,24). In order to achieve this same softness, we crosslink PDMS in a 1×1×5 cm mold at a 30:1 elastomer:crosslinker ratio. However, two more features impart increased stiffness in a human finger. Most of this added rigidity is derived from the bone at the fingertip, the distal phalanx(23-25), which we mimic with an acrylic bone within our PDMS network. The stratum corneum, the stiffer, glassier outer layer of skin(26), is replicated with the surface of the mock finger glassified, or further crosslinked, after 8 hours of UV-Ozone treatment(27). This treatment also modifies the surface properties of the native PDMS to align with those of a human finger more closely. It minimizes the viscoelastic tack at the surface, resulting in a comparable non-sticky surface. At least one day after treatment, the finger surface returns to moderate hydrophilicity (~60º), as is typically observed for a real finger(28).

      The initial contact area formed before a friction trace is collected is a rectangle of 1×1 cm. While this shape is not entirely representative of a human finger with curves and ridges, human fingers flatten out enough to reduce the effects of curvature with even very light pressures(28-30). This implies that regardless of finger pressure, the contact area is largely load-independent, which is more accurately replicated with a rectangular mock finger. It is still a challenge to control pressure distribution with this planar interface, but non-uniform pressures are also expected during human exploration.

      Lastly, we consider fingerprints vs. flat fingers. A key finding of our previous work is that while fingerprints enhanced frictional dynamics at certain conditions, key features were still maintained with a flat finger(7). Furthermore, for some loading conditions, the more amplified signals could also result in more similar friction traces for different surfaces. We have continued to use flat fingers in our mechanical experiments, and have observed good agreement between these friction traces and human experiments(7,8,21,31).”

      (Page 3-4, Materials and Methods)

      “Mock Finger Preparation

      Friction forces across all six surfaces were measured using a custom apparatus with a polydimethylsiloxane (PDMS, Dow Sylgard 184) mock finger that mimics a human finger’s

      mechanical properties and contact mechanics while exploring a surface relatively closely(7,8). PDMS and crosslinker were combined in a 30:1 ratio to achieve a stiffness of 100 kPa comparable to a real finger, then degassed in a vacuum desiccator for 30 minutes. We are aware that the manufacturer recommended crosslinking ratio for Sylgard 184 is 10:1 due to potential uncrosslinked liquid residues(32), but further crosslinking concentrated at the surface prevents this. The prepared PDMS was then poured into a 1×1×5 cm mold also containing an acrylic 3D-printed “bone” to attach applied masses on top of the “fingertip” area contacting a surface during friction testing. After crosslinking in the mold at 60ºC for 1 hour, the finger was treated with UV-Ozone for 8 hours out of the mold to minimize viscoelastic tack.  

      Mechanical Testing

      A custom device using our PDMS mock finger was used to collect macroscopic friction force traces replicating human exploration(7,8). After placing a sample surface on a stage, the finger was lowered at a slight angle such that an initial 1×1 cm rectangle of “fingertip” contact area could be established. We considered a broad range of applied masses (M \= 0, 25, 75, and 100 g) added onto the deadweight of the finger (6 g) observed during a tactile discrimination task. The other side of the sensor was connected to a motorized stage (V-508 PIMag Precision Linear Stage, Physikinstrumente) to control both displacement (4 mm across all conditions) and sliding velocity (v \= 5, 10, 25, and 45 mm s<sup>-1</sup>). Forces were measured at all 16 combinations of mass and velocity via a 250 g Futek force sensor (k \= 13.9 kN m<sup>-1</sup>) threaded to the bone, and recorded at an average sampling rate of 550 Hz with a Keithley 7510 DMM digitized multimeter. Force traces were collected in sets of 4 slides, discarding the first due to contact aging. Because some mass-velocity combinations were near the boundaries of instability phase transitions, not all force traces at these given conditions exhibited similar profiles. Thus, three sets were collected on fresh spots for each condition to observe enough occurrences of multiple instabilities, at a total of nine traces per combination for each surface.”

      Added References (Page 13)

      M. Murai, H.-K. Lau, B. P. Pereira and R. W. H. Pho, J. Hand Surg., 1997, 22, 935–941.

      A. Abdouni, M. Djaghloul, C. Thieulin, R. Vargiolu, C. Pailler-Mattei and H. Zahouani, R. Soc. Open Sci., DOI:10.1098/rsos.170321.

      P.-H. Cornuault, L. Carpentier, M.-A. Bueno, J.-M. Cote and G. Monteil, J. R. Soc. Interface, DOI:10.1098/rsif.2015.0495.

      K. Qian, K. Traylor, S. W. Lee, B. Ellis, J. Weiss and D. Kamper, J. Biomech., 2014, 47, 3094– 3099.

      Y. Yuan and R. Verma, Colloids Surf. B Biointerfaces, 2006, 48, 6–12.

      Y.-J. Fu, H. Qui, K.-S. Liao, S. J. Lue, C.-C. Hu, K.-R. Lee and J.-Y. Lai, Langmuir, 2010, 26, 4392–4399.

      Comment 2

      The evidence would have been much stronger if the measurement of the interaction was done during the psychophysical experiment. In addition, because of the protocol, the correlation is based on aggregates rather than on individual interactions.

      Our Response: We agree that this would have helped further establish our argument, but in the overall statement and in other reviewer responses, we describe the significant challenges to establishing this.

      To fully implement this, a decision-making model is necessary because, as a counter example, a participant could have generated 10 swipes of SFW and 1 swipe of a Sp, but the Sp may have been the most important event for making a tactile decision. We also clarify that our goals are to provide a method to characterize samples to better design tactile interfaces in haptics or in psychophysical experiments.

      In short, in our view, to develop a decision-making model, the challenges are as follows:

      (1) Which one, or combination of, of the multiple swipes that people make responsible for a tactile decision?

      (2) Establish what is, or may be, tactile evidence.

      (3) Establish tactile decision-making models are similar or different than existing decision-making models.

      (4) Test the hypothesis, in these models, that friction instabilities are evidence, and not some other unknown metric.

      (5) Design a task that does not require the use of subjective tactile descriptors, like “which one feels rougher”, which we see cause confusion in participants, which will likely require accounting for memory effects.

      (6) Design samples that vary in the amount of evidence generated, but this evidence cannot be controlled directly. Rather, the samples indirectly vary evidence by how likely it is for a human to generate different types of friction instabilities during standard exploration.

      We elaborate these points below:

      To successfully perform this experiment, we note that freely exploring humans make multiple strokes on a surface. Therefore, we would need to construct a decision-making model. It has not yet been demonstrated whether tactile decision making follows visual decision making, but perhaps to start, we can assume it does. Then, in the design of our decision-making paradigm, we immediately run into the problem: What is tactile evidence?

      From Fig. 3C, we already can see that identifying evidence is challenging. Prior to this manuscript, people may have chosen the average force, or the highest force. Or we may choose the average friction force. Then, after deciding on the evidence, we need to find a method to manipulate the evidence, i.e., create samples or a machine that causes high friction, etc. We show that during the course of human touch, due to the dynamic nature of friction, the average can change a large amount and sample design becomes a central barrier to experiments. Others may suggest to immobilize the finger and applying a known force, but given how much friction changes with human exploration, there is no known method to make a machine recreate temporally and spatially varying friction forces during sliding onto a stationary finger. Finally, perhaps most importantly, in addition to mechanical challenges, a study by Liu, Colgate et al. showed that even if they recorded the friction (2D) of a finger exploring a surface and then replicated the same friction forces onto a finger, the participant could not determine which surface the replayed friction force was supposed to represent.1 This supports that the efference copy is important, that the forces in response to expected motion are important to determine friction. Finally, there is no known method to design instabilities a priori. They must be found through experiments, especially since if we were to introduce, say a bump or a trough, then we bring in confounding variables to how participants tell surfaces apart.

      Furthermore, even if we had some consistent method to create tactile “evidence”, the paradigm also deserves some consideration. In our experience, the 3-AFC task we perform is important because the vocabulary for touch has not been established. That is, in 3-AFC, by asking to determine which one sample is unlike the others, we do not have to ask the participant questions like “which one is rougher” or “which one has less friction”. In contrast, 2-AFC, which is better for decision-making models because it does not include memory, requires the asking of a perceptual question like: “which one is rougher?”. In our ongoing work, taking two silane coatings, we found that participants could easily identify which surface is unlike the others above chance in a 3-AFC, but participants, even within their own trials, could not consistently identify one silane as perceptually “rougher” by 2-AFC. To us, this calls into question the validity of tactile descriptors, but is beyond the scope of the current manuscript.

      This is not our only goal, but in the context of human exploration, in this manuscript here, we believed it was important to identify a mechanical parameter that was consistent with how humans explore surfaces, but was also a parameter that could characterize to some consistent property of a surface – irrespective of whether a human was touching it. We thought that designing human decision-making models and paradigms around the friction coefficient would not be successful.

      Given the scope of these challenges, we do not think it would be possible to establish this conceptual sequence in a single manuscript.

      Comment 3

      The authors compensate with a third experiment where they used a 2AFC protocol and an online force measurement. But the results of this third study, fail to convince the relation.

      With this experiment, our central goal was to demonstrate that the instabilities we have identified with the PDMS finger also occur with a human finger. Several instances of SS, Sp, and SFW were recorded with this setup as a participant touched surfaces in real time.

      Comment 4

      No map of the real finger interaction is shown, bringing doubt to the validity of the frictional map for something as variable as human fingers.

      Real fingers change constantly during exploration, and friction is state-dependent, meaning that the friction will depend on how the person was moving the moment prior. Therefore, a map is only valid for a single human movement – even if participants all were instructed to take a single swipe and start from zero motion, humans are unable to maintain constant velocities and pressures. Clearly, this is not sustainable for any analysis, and these drawbacks apply to any measured parameter, whether instabilities suggested here, or friction coefficients used throughout. We believe the difficulty of this approach emphasizes why a standard map of characterization of a surface by a mock finger, even with its drawbacks, is a viable path forward.

      Reviewer 3 (Recommendations for the authors):

      Comment 1

      It would be interesting to comment on a potential connection between the frictional instability maps and Schalamack waves

      Schallamach waves are a subset of slow frictional waves (SFW). Schallmach waves are very specifically defined. They are a are pockets of air that form between a soft sliding object and rigid surface, and propagate rear-to-front (retrograde waves) as a soft object is slid and buckles due to adhesive pinning. Wrinkles form at the detached portion of the soft material, until the interface reattaches and the process repeats.23 There is typically a high burden of proof to establish a Schallamach wave over a more general slow frictional wave. We note that it would be exceeding difficult to design samples that can reliably create subsets of SFW, but we are aware that this may be an interesting question at a future point in our work.

      Comment 2

      The force sensors look very compliant, and given the dynamic nature of the signal, it is important to characterize the frequency response of the system to make sure that the fluctuations are not amplified.

      Our Response: Thank you for noticing. We mistyped the sensor spring constant as 13.9 N m<sup>-1</sup> instead of kN m<sup>-1</sup>. However, below we show how the instabilities are derived from the mechanics at the interface due to the compliance of the finger. The “springs” of the force sensor and PDMS finger are connected in parallel. Since k<sub>sensor</sub> = 13.9 kN m<sup>-1</sup>, the spring constant of the system overall reflects the compliance of the finger, and highlights the oscillations arising solely from stick-slip. A sample calculation is shown below.

      Author response image 1.

      Fitting a line to the initial slope of the force trace for C6 gives the equation y = 25.679_x_ – 0.2149. The slope here represents force data over time data, and is divided by the velocity (25 mm/s) to determine 𝐹𝐹 the spring constant of the system . This value is lower than ksensor = 13.9 kN/m, indicating that the “springs” representing the force sensor and PDMS finger are connected in parallel: . The finger is the compliant component of the system, with k<sub>finger</sub> = 0.902 N/m, and of course, real human fingers are also compliant so this matches our goals with the design of the mock finger.

      Our changes to the manuscript (Page 4)

      (k \= 13.9 kN m<sup>-1</sup>)

      Comment 3

      The authors should discuss about the stochastic nature of friction:

      Wiertlewski, Hudin, Hayward, IEEE WHC 2011

      Greenspon, McLellan, Lieber, Bensmaia, JRSI 2020”

      We believe that, given the references, this comment on “stochastic” refers to the macroscopically-observable fluctuations (i.e., the mechanical “noise” which is not due to instrument noise) in friction arising from the discordant network of stick-slip phenomena occurring throughout the contact zone, and not the stochastic nature of nanoscale friction that occurs thermal fluctuations nor due to statistical distributions in bond breaking associated with soft contact.

      We first note that our small-scale fluctuations do not arise from a periodic surface texture that dominates in the frequency regime. However, even on our comparatively smooth surfaces, we do expect fluctuations due to nanoscale variation in contact, generation of stick-slip across at microscale length scales that occur either concurrently or discordantly across the contact zone, and the nonlinear dependence of friction to nearly any variation in state and composition(7).

      Perhaps the most relevant to the manuscript is that a major advantage of analysis by friction is that it sidesteps these ever-present microscale fluctuations, leading to more clearly defined classifiers or categories during analysis. Wiertlewski et. al. showed repeated measurements in their systems ultimately gave rise to consistent frequencies(24) (we think their system was in a steady sliding regime and the patterning gave rise to underlying macroscopic waves). These consistent frequencies, at least in soft systems and absent obvious macroscopic patterned features, would be expected to arise from the instability categories and we see them throughout.

      Comment 4

      It is stated that "we observed a spurious, negative correlation between friction coefficient and accuracy”.

      What makes you qualify that correlation as spurious?

      We mean this as in the statistical definition of “spurious”.

      This correlation would indicate that by the metric of friction coefficient, more different surfaces are perceived more similarly. Thus, two very different surfaces, like Teflon and sandpaper, by friction coefficient would be expected to feel very similar. Two nearly identical surfaces would be expected to feel very different – but of course, humans cannot consistently distinguish two identical surfaces. This finding is counterintuitive and refutes that friction coefficient is a reliable classifier of surfaces by touch. We do not think it is productive to determine a mechanism for a spurious correlation, but perhaps one reason we were able to observe this is because our study, to the best of our knowledge, is unique for having samples that are controlled in their physical differences in roughness and surface features.

      Our changes to the manuscript (Page 10)

      “To compare the value of looking at frictional instabilities, we also performed GLMM fits on common approaches in the field, like a friction coefficient or material property typically used in tactile discrimination, shown in Fig. 2D-E. Interestingly, in Fig. 2D, we observed a spurious, negative correlation between friction coefficient (typically and often problematically simplified as across all tested conditions) and accuracy (r = -0.64, p < 0.01); that is, the more different the surfaces are by friction coefficient, the less people can tell them apart. This spurious correlation would be the opposite of intuition, and further calls into question the common practice of using friction coefficients in touch-related studies. The alternative, two-term model which includes adhesive contact area for friction coefficient(29) was even less predictive (see Fig. S6A of SI). We believe such a correlation could not have been uncovered previously as our samples are minimal in their physical variations. Yet, the dynamic changes in force even within a single sample are not considered, despite being a key feature of mesoscale friction during human touch.

      We investigate different material properties in Fig. 2E. Differences in average roughness R<sub>a</sub> (or other parameters, like root mean square roughness R<sub>rms</sub> (Fig. S6A of SI) did not show a statistically significant correlation to accuracy. Though roughness is a popular parameter, correlating any roughness parameter to human performance here could be moot: the limit of detecting roughness differences has previously been defined as 13 nm on structured surfaces(33) and much higher for randomly rough surfaces(46), all of which are magnitudes larger than the roughness differences between our surfaces. The differences in contact angle hysteresis – as an approximation of the adhesion contributions(47) – do not present any statistically significant effects on performance.”

      Comment 5

      The authors should comment on the influence of friction on perceptual invariance. Despite inducing radially different frictional behavior for various conditions, these surfaces are stably perceived. Maybe this is a sign that humans extract a different metric?

      We agree – we are excited that frictional instabilities may offer a more stable perceptual cue because they are not prone to fluctuations (Recommendations for the authors, Comment 3) and instability formation, in many conditions, is invariant to applied pressures and velocities – thus forming large zones where a human may reasonable encounter a given instability.

      Raw friction is highly prone to variation during human exploration (in alignment with Recommendations for the authors, Comment 3), but ongoing work seeks to explain tactile constancy, or the ability to identify objects despite these large changes in force. Very recently published work by Fehlberg et. al. identified the role of modulating finger speed and normal force in amplifying the differences in friction coefficient between materials in order to identify them(25), and we postulate that their work may be streamlined and consistent with the idea of friction instabilities, though we have not had a chance to discuss this in-depth with the authors yet.

      We think that the instability maps show a viable path forward to how surfaces are stably perceived, and instabilities themselves show a potential mechanism: mathematically, instabilities for given conditions can be invariant to velocity or mass, creating zones where a certain instability is encountered. This reduces the immense variability of friction to a smaller, more stable classification of surfaces (e.g., a 30% SS surface or a 60% SS surface). A given surface will typically produce the same instability at a specific condition (we found some boundaries are extremely condition sensitive, but many conditions are not), whereas a single friction trace which is highly prone to variation is not a stable metric.

      Added References (Page 14)

      53 M. Fehlberg, E. Monfort, S. Saikumar, K. Drewing and R. Bennewitz, IEEE Trans. Haptics, 2024, 17, 957–963.

      References

      Z. Liu, J.-T. Kim, J. A. Rogers, R. L. Klatzky and J. E. Colgate, IEEE Trans. Haptics, 2024, 17, 441– 450.

      D. Gueorguiev, S. Bochereau, A. Mouraux, V. Hayward and J.-L. Thonnard, Sci Rep, 2016, 6, 25553.

      C. W. Carpenter, C. Dhong, N. B. Root, D. Rodriquez, E. E. Abdo, K. Skelil, M. A. Alkhadra, J. Ramírez, V. S. Ramachandran and D. J. Lipomi, Mater. Horiz., 2018, 5, 70–77.

      A. Nolin, A. Licht, K. Pierson, C.-Y. Lo, L. V. Kayser and C. Dhong, Soft Matter, 2021, 17, 5050– 5060.

      A. Nolin, K. Pierson, R. Hlibok, C.-Y. Lo, L. V. Kayser and C. Dhong, Soft Matter, 2022, 18, 3928– 3940.

      Z. Swain, M. Derkaloustian, K. A. Hepler, A. Nolin, V. S. Damani, P. Bhattacharyya, T. Shrestha, J. Medina, L. Kayser and C. Dhong, J. Mater. Chem. B, DOI:10.1039/D4TB01646G.

      C. Dhong, L. V. Kayser, R. Arroyo, A. Shin, M. Finn, A. T. Kleinschmidt and D. J. Lipomi, Soft Matter, 2018, 14, 7483–7491.

      A. Abdouni, M. Djaghloul, C. Thieulin, R. Vargiolu, C. Pailler-Mattei and H. Zahouani, Royal Society Open Science, DOI:10.1098/rsos.170321.

      P.-H. Cornuault, L. Carpentier, M.-A. Bueno, J.-M. Cote and G. Monteil, Journal of The Royal Society Interface, DOI:10.1098/rsif.2015.0495.

      K. Qian, K. Traylor, S. W. Lee, B. Ellis, J. Weiss and D. Kamper, J Biomech, 2014, 47, 3094–3099.

      Y.-J. Fu, H. Qui, K.-S. Liao, S. J. Lue, C.-C. Hu, K.-R. Lee and J.-Y. Lai, Langmuir, 2010, 26, 4392– 4399.

      Y. Yuan and R. Verma, Colloids Surf B Biointerfaces, 2006, 48, 6–12.

      G. Yu, J. Hu, J. Tan, Y. Gao, Y. Lu and F. Xuan, Nanotechnology, 2018, 29, 115502.

      L. Zheng, S. Dong, J. Nie, S. Li, Z. Ren, X. Ma, X. Chen, H. Li and Z. L. Wang, ACS Appl. Mater. Interfaces, 2019, 11, 42504–42511.

      K. Ma, J. Rivera, G. J. Hirasaki and S. L. Biswal, Journal of Colloid and Interface Science, 2011, 363, 371–378.

      A. Mavon, H. Zahouani, D. Redoules, P. Agache, Y. Gall and Ph. Humbert, Colloids and Surfaces B: Biointerfaces, 1997, 8, 147–155.

      E. AliAbbasi, M. Muzammil, O. Sirin, P. Lefèvre, Ø. G. Martinsen and C. Basdogan, IEEE Trans. Haptics, 2024, 17, 841–849.

      G. Corniani, Z. S. Lee, M. J. Carré, R. Lewis, B. P. Delhaye and H. P. Saal, eLife, DOI:10.7554/eLife.93554.1.

      J. N. Israelachvili, Intermolecular and Surface Forces, Academic Press, 2011.

      S. Das, N. Cadirov, S. Chary, Y. Kaufman, J. Hogan, K. L. Turner and J. N. Israelachvili, J R Soc Interface, 2015, 12, 20141346.

      B. N. J. Persson, O. Albohr, C. Creton and V. Peveri, The Journal of Chemical Physics, 2004, 120, 8779–8793.

      L. Skedung, M. Arvidsson, J. Y. Chung, C. M. Stafford, B. Berglund and M. W. Rutland, Sci Rep, 2013, 3, 2617.

      K. Viswanathan, N. K. Sundaram and S. Chandrasekar, Soft Matter, 2016, 12, 5265–5275.

      M. Wiertlewski, C. Hudin and V. Hayward, in 2011 IEEE World Haptics Conference, 2011, pp. 25– 30.

      M. Fehlberg, E. Monfort, S. Saikumar, K. Drewing and R. Bennewitz, IEEE Transactions on Haptics, 2024, 17, 957–963.

    1. Author response:

      Public Reviews:<br /> Reviewer #1 (Public review):

      Summary:

      The manuscript discusses the role of phosphorylated ubiquitin (pUb) by PINK1 kinase in neurodegenerative diseases. It reveals that elevated levels of pUb are observed in aged human brains and those affected by Parkinson's disease (PD), as well as in Alzheimer's disease (AD), aging, and ischemic injury. The study shows that increased pUb impairs proteasomal degradation, leading to protein aggregation and neurodegeneration. The authors also demonstrate that PINK1 knockout can mitigate protein aggregation in aging and ischemic mouse brains, as well as in cells treated with a proteasome inhibitor. While this study provided some interesting data, several important points should be addressed before being further considered.

      Strengths:

      (1) Reveals a novel pathological mechanism of neurodegeneration mediated by pUb, providing a new perspective on understanding neurodegenerative diseases.

      (2) The study covers not only a single disease model but also various neurodegenerative diseases such as Alzheimer's disease, aging, and ischemic injury, enhancing the breadth and applicability of the research findings.

      Weaknesses:

      (1) PINK1 has been reported as a kinase capable of phosphorylating Ubiquitin, hence the expected outcome of increased p-Ub levels upon PINK1 overexpression. Figures 5E-F do not demonstrate a significant increase in Ub levels upon overexpression of PINK1 alone, whereas the evident increase in Ub expression upon overexpression of S65A is apparent. Therefore, the notion that increased Ub phosphorylation leads to protein aggregation in mouse hippocampal neurons is not yet convincingly supported.

      Indeed, overexpression of sPINK1* alone caused little change in Ub levels in the soluble fraction (Figure 5E), which is expected. Ub in the soluble fraction is in a relatively stable, buffered state. However, overexpression of sPINK1* resulted in an increase in Ub levels in the insoluble fraction, indicating protein aggregation. The molecular weight of Ub in the insoluble fraction was predominantly below 70 kDa, implying that phosphorylation inhibits Ub chain elongation.

      To further examine this, we used the Ub/S65A mutant to antagonize Ub phosphorylation, and found that the aggregation at low molecular weight was significantly reduced, indicating a partial restoration of proteasomal activity. The increase in Ub levels in both the soluble and insoluble fractions likely results from the high rate of ubiquitination driven by the elevated levels of Ub. Notably, the overexpressed Ub/S65A was detected in the Western blot using the wild-type Ub antibody, which accounts for the apparently increased Ub level.

      When overexpressing Ub/S65E, we again saw an increase in Ub levels in the insoluble fraction (but no increase in the soluble fraction), with low molecular weight bands even more prominent than those observed with sPINK1* transfection. These findings collectively support the conclusion that sPINK1* promotes protein aggregation through Ub phosphorylation.

      (2) The specificity of PINK1 and p-Ub antibodies requires further validation, as a series of literature indicate that the expression of the PINK1 protein is relatively low and difficult to detect under physiological conditions.

      We acknowledge the challenges in achieving optimal specificity for commercially available and custom-generated antibodies targeting PINK1 and pUb, particularly given the low endogenous levels of these proteins under physiological conditions. Despite these limitations, we observed robust immunofluorescent staining for PINK1 (Figures 1A, 1C, and 1G) and pUb (Figures 1B, 1D, and 1G) in human brain samples from Alzheimer's disease (AD) patients, as well as in mouse brains from models of AD and cerebral ischemia. The significant elevation of PINK1 and pUb under these pathological conditions likely accounts for the clear visualization. To validate antibody specificity, we have included images from pink1-/- mice as negative controls in the revised manuscript (Figure 1C and 1D, third panel).

      In addition, we detected a significant increase in pUb levels in aged mouse brains compared to young ones (Figures 1E and 1F). Notably, in pink1-/- mice, pUb levels remained unchanged between young and aged groups, despite some background signal, further supporting the conclusion that pUb accumulation during aging is PINK1-dependent.

      In HEK293 cells, pink1-/- cells served as a negative control for PINK1 (Figure 2B and 2C) and for pUb (Figure 2D and 2E). While the Western blot using the pUb antibody displayed some nonspecific background, pUb levels in pink1-/- cells remained unchanged across all MG132 treatment conditions (Figures 2D and 2E), further attesting the reliability of our findings.

      (3) In Figure 6, relying solely on Western blot staining and Golgi staining under high magnification is insufficient to prove the impact of PINK1 overexpression on neuronal integrity and cognitive function. The authors should supplement their findings with immunostaining results for MAP2 or NeuN to demonstrate whether neuronal cells are affected.

      Thank you for raising this important point. We included NeuN immunofluorescent staining in Figure 5—figure supplement 2 of the original manuscript. The results demonstrate a significant loss of NeuN-positive cells in the hippocampus following Ub/S65E overexpression, while no apparent change in NeuN-positive cells was observed with sPINK1* transfection alone. These findings provide evidence of neuronal loss in response to Ub/S65E, further supporting the impact of pUb elevation on neuronal integrity.

      While we did not perform MAP2 immunostaining, we included complementary analyses to assess neuronal integrity. Specifically, we performed Western blotting to determine MAP2 protein levels and used Golgi staining to study neuronal morphology and synaptic structure in greater detail. These analyses revealed that overexpression of sPINK1* or Ub/S65E decreased MAP2 levels and caused damage to synaptic structures (Figures 6F and 6H). Importantly, the deleterious effects of sPINK1* overexpression could be rescued by co-expression of Ub/S65A, further underscoring the role of pUb in mediating these changes.

      Together, our NeuN immunostaining, MAP2 analysis, and Golgi staining provide strong evidence for the impact of PINK1 overexpression and pUb elevation on neuronal integrity and synaptic health. We believe these complementary approaches sufficiently address the reviewer’s concern and highlight the pathological consequences of elevated pUb levels.

      (4) The authors should provide more detailed figure captions to facilitate the understanding of the results depicted in the figures.

      Figure captions will be updated with more details in the revised manuscript.

      (5) While the study proposes that pUb promotes neurodegeneration by affecting proteasomal function, the specific molecular mechanisms and signaling pathways remain to be elucidated.

      The specific molecular mechanisms and signaling pathways through which pUb promotes neurodegeneration are likely multifaceted and interconnected. Mitochondrial dysfunction appears to be a central contributor to neurodegeneration following sPINK1* overexpression. This is supported by (1) an observed increase in full-length PINK1, indicative of impaired mitochondrial quality control, and (2) proteomic data revealing enhanced mitophagy at 30 days post-transfection and substantial mitochondrial injury by 70 days post-transfection. The progressive damage to mitochondria caused by protein aggregates can cause further neuronal injury and degeneration.

      In addition, reduced proteasomal activity may result in the accumulation of inhibitory proteins that are normally degraded by the ubiquitin-proteasome system. Our proteomics analysis identified a >54-fold increase in CamK2n1 (UniProt ID: Q6QWF9), an endogenous inhibitor of CaMKII activation, following sPINK1* overexpression. This is particularly significant because the accumulation of CamK2n1 could suppress CaMKII activation and, subsequently, inhibit the CREB signaling pathway (illustrated below). As CREB is essential for synaptic plasticity and neuronal survival, its inhibition may further amplify neurodegenerative processes.

      While our study identifies proteasomal dysfunction and mitochondrial damage as key initial triggers, downstream effects—such as disruptions in signaling pathways like CaMKII-CREB—likely contribute to a broader cascade of pathological events. These findings highlight the complexity of pUb-mediated neurodegeneration and suggest that further exploration of downstream mechanisms is necessary to fully elucidate the pathways involved.

      We plan to include the proteomics data, in the revised manuscript, of mouse brain tissues at 30 days and 70 days post-transfection, to further highlight this downstream effect upon proteasomal dysfunction.

      Author response image 1.

      Reviewer #2 (Public review):

      Summary:

      The manuscript makes the claim that pUb is elevated in a number of degenerative conditions including Alzheimer's Disease and cerebral ischemia. Some of this is based on antibody staining which is poorly controlled and difficult to accept at this point. They confirm previous results that a cytosolic form of PINK1 accumulates following proteasome inhibition and that this can be active. Accumulation of pUb is proposed to interfere with proteostasis through inhibition of the proteasome. Much of the data relies on over-expression and there is little support for this reflecting physiological mechanisms.

      Weaknesses:

      The manuscript is poorly written. I appreciate this may be difficult in a non-native tongue, but felt that many of the problems are organisational. Less data of higher quality, better controls and incision would be preferable. Overall the referencing of past work is lamentable.

      Methods are also very poor and difficult to follow.<br /> Until technical issues are addressed I think this would represent an unreliable contribution to the field.

      (1) Antibody specificity and detection under pathological conditions

      We acknowledge the limitations of commercially available antibodies for detecting PINK1 and pUb. Despite these challenges, our findings demonstrate a significant increase in PINK1 and pUb levels under pathological conditions, such as Alzheimer's disease (AD) and ischemia. Additionally, we observed an increase in pUb level during brain aging, further highlighting its relevance in this particular physiological process. To ensure reliable quantification of PINK1 and pUb levels, we used pink1-/- mice and HEK293 cells as negative controls. For example, PINK1 levels were extremely low in control cells but increased dramatically after 2 hours of oxygen-glucose deprivation (OGD) and 6 hours of reperfusion (Figure 1H). Together, these controls validate that the observed elevations in PINK1 and pUb are specific and linked to pathological or certain physiological conditions.

      (2)  Overexpression as a model for pathological conditions

      To investigate whether the inhibitory effects of sPINK1* on the ubiquitin-proteasome system (UPS) are dependent on its kinase activity, we utilized a kinase-dead version of sPINK1* as a negative control. Since PINK1 has multiple substrates, we further explored whether its effects on UPS inhibition were mediated specifically by ubiquitin phosphorylation. For this, we used Ub/S65A (a phospho-null mutant) to antagonize Ub phosphorylation by sPINK1*, and Ub/S65E (a phospho-mimetic mutant) to mimic phosphorylated Ub. These well-defined controls ensured the robustness of our conclusions.

      While overexpression does not perfectly replicate physiological conditions, it serves as a valuable model for studying pathological scenarios such as neurodegeneration and brain aging, where pUb levels are known to increase. For example, we observed a 30.4% increase in pUb levels in aged mouse brains compared to young brains (Figure 1F). Similarly, in our sPINK1* overexpression model, pUb levels increased by 43.8% and 59.9% at 30- and 70-days post-transfection, respectively, compared to controls (Figures 5A and 5C). Notably, co-expression of sPINK1* with Ub/S65A almost entirely prevented sPINK1* accumulation (Figure 5B), indicating that an active UPS can efficiently degrade sPINK1*. Collectively, these findings show that sPINK1* accumulation inhibits UPS activity, a defect that can be rescued by the phospho-null Ub mutant. Thus, this overexpression model closely mimics pathological conditions and offers valuable insights into pUb-mediated proteasomal dysfunction.

      (3) Organization of the manuscript

      We believe the structure of the manuscript is justified and systematically addresses the key aspects of the study in a logic flow:

      (a) Evidence for the increase of PINK1 and pUb in multiple pathological and physiological conditions.

      (b) Identification of the sources and consequences of sPINK1 and pUb elevation.

      (c) Mechanistic insights into how pUb inhibits UPS-mediated degradation.

      (d) Validation of these findings using pink1-/- mice and cells.

      (e) Evidence of the reciprocal relationship between proteasomal inhibition and pUb elevation, culminating in neurodegeneration.

      (f) Demonstration of elevated pUb levels and protein aggregation in the hippocampus following sPINK1* overexpression, supported by proteomic analyses, behavioral tests, Western blotting, and Golgi staining.

      Thus, this organization provides a clear and cohesive narrative, culminating in the demonstration that sPINK1* overexpression induces hippocampal neuron degeneration.

      (4) Revisions to writing, referencing, and methodology

      We will improve the clarity and flow of the manuscript, add more references to properly acknowledge prior work, and incorporate additional details into the Methods section to enhance readability and reproducibility. These improvements should address the organizational and technical concerns raised, while strengthen the overall quality of the manuscript.

      Reviewer #3 (Public review):

      Summary:

      This study aims to explore the role of phosphorylated ubiquitin (pUb) in proteostasis and its impact on neurodegeneration. By employing a combination of molecular, cellular, and in vivo approaches, the authors demonstrate that elevated pUb levels contribute to both protective and neurotoxic effects, depending on the context. The research integrates proteasomal inhibition, mitochondrial dysfunction, and protein aggregation, providing new insights into the pathology of neurodegenerative diseases.

      Strengths:

      - The integration of proteomics, molecular biology, and animal models provides comprehensive insights.

      - The use of phospho-null and phospho-mimetic ubiquitin mutants elegantly demonstrates the dual effects of pUb.

      - Data on behavioral changes and cognitive impairments establish a clear link between cellular mechanisms and functional outcomes.

      Weaknesses:

      - While the study discusses the reciprocal relationship between proteasomal inhibition and pUb elevation, causality remains partially inferred.

      The reciprocal cycle between proteasomal inhibition and pUb elevation can be initiated by various factors that impair proteasomal activity. These factors include Aβ accumulation, ATP depletion, reduced expression of proteasome components, and covalent modifications of proteasomal subunits—all well-established contributors to the progressive decline in proteasome function. Once initiated, this cycle would become self-perpetuating, with the accumulation of sPINK1 and pUb driving a feedback loop of deteriorating proteasomal activity.

      In the current study, this reciprocal relationship between sPINK1/pUb elevation and proteasomal dysfunction is depicted in Figure 4A. Our results demonstrate that increased sPINK1 or PINK1 levels, such as through overexpression, can initiate this cycle. Crucially, co-expression of Ub/S65A effectively rescues the cells from this cycle, highlighting the pivotal role of pUb in driving proteasomal inhibition and establishing causality in this relationship. At the animal level, pink1 knockout could prevent protein aggregation upon aging and cerebral ischemia (Figures 1E and 1G).

      Mitochondrial injury is a likely source of elevated PINK1 and pUb levels. A recent study showed that efficient mitophagy is necessary to prevent pUb accumulation (bioRxiv 2023.02.14.528378), suggesting that mitochondrial damage can trigger this cycle. In another study (bioRxiv 2024.07.03.601901), the authors found that mitochondrial damage could enhance PINK1 transcription, further increasing cytoplasmic PINK1 levels and exacerbating the cycle.

      - The role of alternative pathways, such as autophagy, in compensating for proteasomal dysfunction is underexplored.

      Elevated sPINK1 has been reported to enhance autophagy (Autophagy 2016, 12: 632-647), potentially compensating for the impaired UPS. One mechanism involves the phosphorylation of p62 by sPINK1, which enhances autophagy activity. In our study, we did observe increased autophagic activity upon sPINK1* overexpression, as shown in Figure 2I (middle panel, without BALA). This increased autophagy may help degrade ubiquitinated proteins induced by puromycin, partially compensating for the proteasomal dysfunction.

      This compensation might explain why protein aggregation only increased slightly, though statistically significant, at 70 days post sPINK1* transfection (Figure 5F). Additionally, we observed a slight, though statistically insignificant, increase in LC3II levels in the hippocampus of mouse brains at 70 days post sPINK1* transfection (Figure 5—figure supplement 6), further supporting the notion of autophagy activation.

      However, while autophagy may provide some compensation, its effect is likely limited. Autophagy and UPS differ significantly in their roles and mechanisms of degradation. Autophagy is a bulk degradation pathway that is generally non-selective, targeting long-lived proteins, damaged organelles, and intracellular pathogens. In contrast, the UPS is highly selective, primarily degrading short-lived regulatory proteins, misfolded proteins, and proteins tagged for degradation.

      Together, we found that sPINK1* overexpression enhanced autophagy-mediated protein degradation while simultaneously impairing UPS-mediated degradation. This suggests that while autophagy may provide partial compensation for proteasomal dysfunction, it is not sufficient to fully counterbalance the selective degradation functions of the UPS.

      - The immunofluorescence images in Figure 1A-D lack clarity and transparency. It is not clear whether the images represent human brain tissue, mouse brain tissue, or cultured cells. Additionally, the DAPI staining is not well-defined, making it difficult to discern cell nuclei or staging. To address these issues, lower-magnification images that clearly show the brain region should be provided, along with improved DAPI staining for better visualization. Furthermore, the Results section and Figure legends should explicitly indicate which brain region is being presented. These concerns raise questions about the reliability of the reported pUb levels in AD, which is a critical aspect of the study's findings.

      We will include low-magnification images in the supplementary figures of the revised manuscript to provide a broader context for the immunofluorescence data presented in Figure 1. DAPI staining at higher magnifications will also be provided to improve visualization of cell nuclei and overall tissue structure. Additionally, we will indicate the brain regions examined in the corresponding figure legends, and incorporate more details in the Results section to provide clearer descriptions of the samples and brain regions analyzed.

      The human brain samples presented in Figure 1 are from the cingulate gyrus region of Alzheimer's disease (AD) patients. Our analysis revealed that PINK1 is primarily localized within cell bodies, while pUb is more abundant around Aβ plaques, likely in nerve terminals. These additional clarifications and supplementary figures should provide greater transparency and improve the reliability of our findings.

      - Figure 4B should also indicate which brain region is being presented.

      The images were taken for layer III-IV in the neocortex of mouse brains, which information will be incorporated in the figure legend of the revised manuscript.

    1. Author Response:

      This work presents valuable information about the specificity and promiscuity of toxic effector and immunity protein pairs. The evidence supporting the claims of the authors is currently incomplete, as there is concern about the methodology used to analyze protein interactions, which did not take potential differences in expression levels, protein folding, and/or transient interaction into account. Other methods to measure the strength of interactions and structural predictions would improve the study. The work will be of interest to microbiologists and biochemists working with toxin-antitoxin and effector-immunity proteins.

      We thank the reviewers for considering this manuscript. We agree that this manuscript provides a valuable and cross-discipline introduction to new EI pair protein families where we focus on the EI pair’s flexibility and impacts on community structure. As such, we believe we have provided a solid foundation for future studies to examine non-cognate interactions and their possible effects on microbial communities. This, by definition, leaves some areas “incomplete” and, therefore, open for further investigations. While the methods we show do take into account potential differences in binding assays, we will more explicitly address how “expression, protein folding, and/or transient binding” may play into this expanded EI pair model upon revision and temper the discussion of the proposed model. We have responded to the reviewers’ public comments (italicized below).

      Public Reviews:

      Note: Reviewer 1, who appeared to focus on a subset of the manuscript rather than the whole, based their comments on several inaccuracies, which we discuss below. We found the tone in this reviewer's comments to be, at times, inappropriate, e.g., using "harsh" and "simply too drastic" to imply that common structure-function analyses were outside of the field-standard methods. We also note that the reviewer took a somewhat atypical step in reviewing this manuscript by running and analyzing the potential protein-complex data in AlphaFold2 but did not discuss areas of low confidence within that model that may contradict their conclusions. We are concerned their approach muddled valid scientific criticisms with problematic conclusions.

      Reviewer #1 (Public Review):

      In this manuscript, Knecht, Sirias et al describe toxin-immunity pair from Proteus mirabilis. Their observations suggest that the immunity protein could protect against non-cognate effectors from the same family. They analyze these proteins by dissecting them into domains and constructing chimeras which leads them to the conclusion that the immunity can be promiscuous and that the binding of immunity is insufficient for protective activity.

      Strengths:

      The manuscript is well written and the data are very well presented and could be potentially interesting. The phylogenetic analysis is well done, and provides some general insights.

      Weaknesses:

      1) Conclusions are mostly supported by harsh deletions and double hybrid assays. The later assays might show binding, but this method is not resolutive enough to report the binding strength. Proteins could still bind, but the binding might be weaker, transient, and out-competed by the target binding.

      The phrasing of structure-function analyses as “harsh” is a bit unusual, as other research groups regularly use deletions and hybrid studies. Given the known caveats to deletion and domain substitutions, we included point-mutation analyses for both the effector and immunity proteins, as found on lines 105 - 113 and 255 - 261 in the current manuscript. These caveats are also why we coupled the in vitro binding analyses with in vivo protection experiments in two distinct experimental systems (E. coli and P. mirabilis). Based on this manuscript’s introductory analysis (where we define and characterize the genes, proteins, interactions, phylogenetics, and incidences in human microbiomes), the next apparent questions are beyond the scope of this study. Future approaches would include analyzing purified proteins from these effector (E) and immunity (I) protein families using biochemical assays, such as X-ray crystallography, circular dichroism spectroscopy, among others.

      (Interestingly, most papers in the EI field do not measure EI protein affinity (Jana et al., 2019, Yadav et al., 2021). Notable exceptions are earlier colicin research (Wallis et al., 1995) and a new T6SS EI paper (Bosch et al., 2023) published as we submitted this manuscript.)

      2) While the authors have modeled the structure of toxin and immunity, the toxin-immunity complex model is missing. Such a model allows alternative, more realistic interpretation of the presented data. Firstly, the immunity protein is predicted to bind contributing to the surface all over the sequence, except the last two alpha helices (very high confidence model, iPTM>0.8). The N terminus described by the authors contributes one of the toxin-binding surfaces, but this is not the sole binding site. Most importantly, other parts of the immunity protein are predicted to interact closer to the active site (D-E-K residues). Thus, based on the AlphaFold model, the predicted mechanism of immunization remains physically blocking the active site. However, removing the N terminal part, which contributes large interaction surface will directly impact the binding strength. Hence, the toxin-immunity co-folding model suggests that proper binding of immunity, contributed by different parts of the protein, is required to stabilize the toxin-immunity complex and to achieve complete neutralization. Alternative mechanisms of neutralization might not be necessary in this case and are difficult to imagine for a DNAse.

      In response to the reviewer’s comment, we again reviewed the RdnE-RdnI AlphaFold2 complex predictions with the most updated version of ColabFold (1.5.2-patch with PDB100 and MMseq2) and have included them at the end of the responses [1].

      However, the literature reports that computational predictions of E-I complexes often do not match experimental structural results (Hespanhol et al., 2022, Bosch et al., 2023). As such, we chose not to include the predicted cognate and non-cognate RdnE-I complexes from ColabFold (which uses AlphaFold2) and will not include this data in revised manuscripts. (It is notable that reviewer 1 found the proposed expanded model and research so interesting as to directly input and examine the AI-predicted RdnE-RdnI protein interactions in AlphaFold2.)

      Discussion of the prevailing toxin-immunity complex model is in the introduction (lines 45-48) and Figure 5E. Further, there are various known mechanisms for neutralizing nucleases and other T6SS effectors, which we briefly state in the discussion (lines 359 - 361). More in-depth, these molecular mechanisms include active-site blocking (Benz et al., 2012), allosteric-site binding (Kleanthous et al., 1999 and Lu et al., 2014), enzymatic neutralization of the target (Ting et al., 2021), and structural disruption of both the active and binding sites (Bosch et al., 2023). Given this diversity of mechanisms, we did not presume to speculate on the as-of-yet unknown mechanism of RdnI protection.

      3) Dissection of a toxin into two domains is also not justified from a structural point of view, it is probably based on initial sequence analyses. The N terminus (actually previously reported as Pone domain in ref 21) is actually not a separate domain, but an integral part of the protein that is encased from both sides by the C terminal part. These parts might indeed evolve faster since they are located further from the active site and the central core of the protein. I am happy to see that the chimeric toxins are active, but regarding the conservation and neutralization, I am not surprised, that the central core of the protein fold is highly conserved. However, "deletion 2" is quite irrelevant - it deletes the central core of the protein, which is simply too drastic to draw any conclusions from such a construct - it will not fold into anything similar to an original protein, if it will fold properly at all.

      The reviewer’s comment highlights why we turned to the chimera proteins to dissect the regions of RdnE (formerly IdrD-CT), as the deletions could result in misfolded proteins. (We initially examined RdnE in the years before the launch of AlphaFold2.) However, the reviewer is incorrect regarding the N-terminus of RdnE. The PoNe domain, while also a subfamily of the PD-(D/E)XK superfamily, forms a distinct clade of effectors from the PD-(D/E)XK domain in RdnE (formally IdrD-CT) as seen in Hespanhol et al., 2022; this is true for other DNAse effectors as well. Many studies analyzing effectors within the PD-(D/E)XK superfamily only focus on the PD-(D/E)XK domain, removing just this domain from the context of the whole protein (Hespanhol et al., 2022; Jana et al., 2019). Of note, in RdnE, this region alone (containing the DNA-binding domain) is insufficient for DNAse activity (unlike in PoNe).

      4) Regarding the "promiscuity" there is always a limit to how similar proteins are, hence when cross-neutralization is claimed authors should always provide sequence similarities. This similarity could also be further compared in terms of the predicted interaction surface between toxin and immunity.

      Reviewer 1 points out a fundamental property of protein-protein interactions that has been isolated away from the impacts of such interactions on bacterial community structure. We have provided the whole protein alignments in supplemental figure 3, the summary images in Figure 3D, and the protein phylogenetic trees in Figure 3C. We encourage others to consider the protein alignments as percent amino acid sequence similarity is not necessarily a good gauge for protein function and interactions. RuBisCo is one example of how protein sequence similarity can be small while functions remain highly conserved. These data are publicly available on the OSF website associated with this manuscript https://osf.io/scb7z/, and we hope the community explores the data there.

      In consideration of the enthusiasm to deeply dive into the primary research data, we have included the pairwise sequence identities across the entire proteins here: Proteus RdnI vs. Rothia RdnI: 23.6%; Proteus RdnI vs. Prevotella RdnI: 16.3%, Proteus RdnI vs. Pseudomonas RdnI: 14.6%; Rothia RdnI vs. Prevotella RdnI: 22.4%, Rothia RdnI vs. Pseudomonas RdnI: 17.6%; Prevotella RdnI vs. Pseudomonas RdnI: 19.5%. (As stated in response to reviewer 1 comment 2, we do not find it appropriate to make inferences based on AlphaFold2-predicted protein complexes.)

      Overall, it looks more like a regular toxin-immunity couple, where some cross-reactions with homologues are possible, depending on how far the sequences have deviated. Nevertheless, taking all of the above into account, these results do not challenge toxin-immunity specificity dogma.

      In this manuscript, we did not intend to dismiss the E-I specificity model but rather point out its limitations and propose an important expansion of that model that accounts for cross-protection and survival against attacks from other genera. We agree that it is commonly considered that deviations in amino acid sequence over time could result in cross-binding and protection (see lines 364-368). However, the impacts of such cross-binding on community structure, bacterial survival, and strain evolution have rarely been considered or addressed in prior literature, with exceptions such as in Zhang et al., 2013 and Bosch et al., 2023. One key insight we propose and show in this manuscript is that cross-binding can be a fitness benefit in mixed communities; therefore, it could be selected for evolutionarily (lines 378-380), even potentially in host microbiomes.

      Reviewer #2 (Public Review):

      Summary:

      The manuscript by Knecht et al entitled "Non-cognate immunity proteins provide broader defenses against interbacterial effectors in microbial communities" aims at characterizing a new type VI secretion system (T6SS) effector immunity pair using genetic and biochemical studies primarily focused on Proteus mirabilis and metagenomic analysis of human-derived data focused on Rothia and Prevotella sequences. The authors provide evidence that RdnE and RdnI of Proteus constitute an E-I pair and that the effector likely degrades nucleic acids. Further, they provide evidence that expression of non-cognate immunity derived from diverse species can provide protection against RdnE intoxication. Overall, this general line of investigation is underdeveloped in the T6SS field and conceptually appropriate for a broad audience journal. The paper is well-written and, aside from a few cases, well-cited. As detailed below however, there are several aspects of this paper where the evidence provided is somewhat insufficient to support the claims. Further, there are now at least two examples in the literature of non-cognate immunity providing protection against intoxication, one of which is not cited here (Bosch et al PMID 37345922 - the other being Ting et al 2018). In general therefore I think that the motivating concept here in this paper of overturning the predominant model of interbacterial effector-immunity cognate interactions is oversold and should be dialed back.

      We agree that analyses focusing on flexible non-cognate interactions and protection are underdeveloped within the T6SS field and are not fully explored within a community structure. These ideas are rapidly growing in the field, as evidenced by the references provided by the reviewer. As stated earlier, we did not intend to overturn the prevailing model but rather propose an expanded model that accounts for protection against attacks from foreign genera.

      Strengths:

      One of the major strengths of this paper is the combination of diverse techniques including competition assays, biochemistry, and metagenomics surveys. The metagenomic analysis in particular has great potential for understanding T6SS biology in natural communities. Finally, it is clear that much new biology remains to be discovered in the realm of T6SS effectors and immunity.

      Weaknesses:

      The authors have not formally shown that RdnE is delivered by the T6SS. Is it the case that there are not available genetics tools for gene deletion for the BB2000 strain? If there are genetic tools available, standard assays to demonstrate T6SS-dependency would be to interrogate function via inactivation of the T6SS (e.g. by deleting tssC).

      Our research group showed that the T6SS secretes RdnE (previously IdrD) in Wenren et al., 2013 (cited in lines 71-73). We later confirmed T6SS-dependent secretion by LC-MS/MS (Saak et al., 2017).

      For swarm cross-phyla competition assays (Figure 4), at what level compared to cognate immunity are the non-cognate immunity proteins being expressed? This is unclear from the methods and Figure 4 legend and should be elaborated upon. Presumably these non-cognate immunity proteins are being overexpressed. Expression level and effector-to-immunity protein stoichiometry likely matters for interpretation of function, both in vitro as well as in relevant settings in nature. It is important to assess if native expression levels of non-cognate cross-phyla immunity (e.g. Rothia and Prevotella) protect similarly as the endogenously produced cognate immunity. This experiment could be performed in several ways, for example by deleting the RdnE-I pair and complementing back the Rothia or Prevotella RdnI at the same chromosomal locus, then performing the swarm assay. Alternatively, if there are inducible expression systems available for Proteus, examination of protection under varying levels of immunity induction could be an alternate way to address this question. Western blot analysis comparing cognate to non-cognate immunity protein levels expressed in Proteus could also be important. If the authors were interested in deriving physical binding constants between E and various cognate and non-cognate I (e.g. through isothermal titration calorimetry) that would be a strong set of data to support the claims made. The co-IP data presented in supplemental Figure 6 are nice but are from E. coli cells overexpressing each protein and do not fully address the question of in vivo (in Proteus) native expression.

      P. mirabilis strain ATCC29906 does not encode the rdnE and rdnI genes on the chromosome (NCBI BioSample: SAMN00001486) (line 151). Production of the RdnI proteins, including the cognate Proteus RdnI, comes from equivalent transgenic expression vectors. Specifically, the rdnI genes were expressed under the flaA promoter in P. mirabilis strain ATCC29906 (Table 1) for the swarm competition assays found in Figure 2C and Figure 4. This promoter results in constitutive expression in swarming cells (Belas et al., 1991; Jansen et al., 2003).

      Lines 321-324, the authors infer differences between E and I in terms of read recruitment (greater abundance of I) to indicate the presence of orphan immunity genes in metagenomic samples (Figure 5A-D). It seems equally or perhaps more likely that there is substantial sequence divergence in E compared to the reference sequence. In fact, metagenomes analyzed were required only to have "half of the bases on reference E-I sequence receiving coverage". Variation in coverage again could reflect divergent sequence dipping below 90% identity cutoff. I recommend performing metagenomic assemblies on these samples to assess and curate the E-I sequences present in each sample and then recalculating coverage based on the exact inferred sequences from each sample.

      This comment raises the challenges with metagenomic analyses. It was difficult to balance specificity to a particular species’ DNA sequence with the prevalence of any homologous sequence in the sample. Given the distinction in binding interactions among the examined four species, we opted to prioritize specificity, accepting that we were losing access to some rdnE and rdnI sequences in that decision. We chose a 90% identity cutoff, which, through several in silica controls, ensured that each sequence we identified was the rdnE or rdnI gene from that specific species. For the Version of Record, we will revisit this decision and consider trying to account for sequence divergence by lowering the identity cutoffs as suggested.

      A description of gene-level read recruitment in the methods section relating to metagenomic analysis is lacking and should be provided.

      Noted. We will also include the raw code and sequences on the OSF website associated with this manuscript https://osf.io/scb7z/.

      Reviewer #3 (Public Review):

      [...] Strengths:

      The authors presented a strong rationale in the manuscript and characterized the molecular mechanism of the RdnE effector both in vitro and in the heterologous expression model. The utilization of the bacterial two-hybrid system, along with the competition assays, to study the protective action of RdnI immunity is informative. Furthermore, the authors conducted bioinformatic analyses throughout the manuscript, examining the primary sequence, predicted structural, and metagenomic levels, which significantly underscore the significance and importance of the EI pair.

      Weaknesses:

      1. The interaction between RdnI and RdnE appears to be complex and requires further investigation. The manuscript's data does not conclusively explain how RdnI provides a "promiscuous" immunity function, particularly concerning the RdnI mutant/chimera derivatives. The lack of protection observed in these cases might be attributed to other factors, such as a decrease in protein expression levels or misfolding of the proteins. Additionally, the transient nature of the binding interaction could be insufficient to offer effective defenses.

      Yes, we agree with the reviewer and hope that grant reviewers’ share this colleague’s enthusiasm for understanding the detailed molecular mechanisms of RdnE-RdnI binding across genera. We will continue to emphasize such caveats as the next frontier is clearly understanding the molecular mechanisms for RdnI cognate or non-cognate protection. We address the concerns regarding expression levels in the response to reviewer 2, comment 2.

      1. The results from the mixed population competition lack quantitative analysis. The swarm competition assays only yield binary outcomes (Yes or No), limiting the ability to obtain more detailed insights from the data.

      The mixed swam assay is needed when studying T6SS effectors that are primarily secreted during Proteus’ swarming activity (Saak et al. 2017, Zepeda-Rivera et al. 2018). This limitation is one reason we utilize in vitro, in vivo, and bioinformatic analyses. Though the swarm competition assay yields a binary outcome, we are confident that the observed RdnI protection is due to interaction with a trans-cell RdnE via an active T6SS. By contrast, many manuscripts report co-expression of the EI pair (Yadev et al., 2021, Hespanhol et al., 2022) rather than secreted effectors, as we have achieved in this manuscript.

      1. The discovery of cross-species protection is solely evident in the heterologous expression-competition model. It remains uncertain whether this is an isolated occurrence or a common characteristic of RdnI immunity proteins across various scenarios. Further investigations are necessary to determine the generality of this behavior.

      We agree, which is why we submitted this paper as a launching point for further investigations into the generality of non-cognate interactions and their potential impact on community structure.

      Comments from Reviewing Editor:

      • In addition to the references provided by reviewer#2, the first manuscript to show non-cognate binding of immunity proteins was Russell et al 2012 (PMID: 22607806).
      • IdrD was shown to form a subfamily of effectors in this manuscript by Hespanhol et al 2022 PMID: 36226828 that analyzed several T6SS effectors belonging to PDDExK, and it should be cited.

      We appreciate that the reviewer and eLife staff pointed out missed citations. A revised manuscript will incorporate those studies and cite them appropriately.

      [1] The Proteus RdnE in complex with either the Prevotella or Pseudomonas RdnI showed low confidence at the interface (pIDDT ~50-70%); this AI-predicted complex might support the lack of binding seen in the bacterial two-hybrid assay. On the other hand, the Proteus and Rothia RdnI N-terminal regions show higher confidence at the interface with RdnE. Despite this, the C-terminus of the Proteus RdnI shows especially low confidence (pIDDT ~50%) where it might interact near RdnE’s active site (as suggested by reviewer 1). Given this low confidence and the already stated inaccuracies of AI-generated complexes, we would rather wait for crystallization data to inform potential protection mechanisms of RdnI.

      Author response image 1.

    1. Author response:

      Description of the planned revisions

      Reviewer #1 (Evidence, reproducibility and clarity):

      Summary

      The authors focused on medaka retinal organoids to investigate the mechanism underlying the eye cup morphogenesis. The authors succeeded to induce lens formation in fish retinal organoids using 3D suspension culture with minimal growth factor-containing media containing the Hepes. At day 1, Rx3:H2B-GFP+ cells appear in the surface region of organoids. At day 1.5, Prox1+cells appear in the interface area between the organoid surface and the core of central cell mass, which develops a spherical-shaped lens later. So, Prox1+ cells covers the surface of the internal lens cell core. At day 2, foxe3:GFP+ cells appear in the Prox1+ area, where early lens fiber marker, LFC, starts to be expressed. In addition, foxe3:GFP+ cells show EdU+ incorporation, indicating that foxe3:GFP+ cells have lens epithelial cell-characters. At day 4, cry:EGFP+ cells differentiate inside the spherical lens core, whose the surface area consists of LFC+ and Prox1+ cells. Furthermore, at day 4, the lens core moves towards the surface of retinal organoids to form an eye-cup like structure, although this morphogenesis "inside out" mechanism is different from in vivo cellular "outside -in" mechanism of eye cup formation. From these data, the authors conclude that optic cup formation, especially the positioning of the lens, is established in retinal organoids though the different mechanism of in vivo morphogenesis.

      Overall, manuscript presentation is nice. However, there are still obscure points to understand background mechanism. My comments are shown below.

      Major comments

      (1) At the initial stage of retinal organoid morphogenesis, a spherical lens is centrally positioned inside the retinal organoids, by covering a central lens core by the outer cell sheet of retinal precursor cells. I wonder if the formation of this structure may be understood by differential cell adhesive activity or mechanical tension between lens core cells and retinal cell sheet, just like the previous study done by Heisenberg lab on the spatial patterning of endoderm, mesoderm and ectoderm (Nat. Cell Biol. 10, 429 - 436 (2008)). Lens core cells may be integrated inside retinal cell mass by cell sorting through the direct interaction between retinal cells and lens cells, or between lens cells and the culture media. After day 1, it is also possible to understand that lens core moves towards the surface of retinal organoids, if adhesive/tensile force states of lens core cells may be change by secretion of extracellular matrix. I wonder if the authors measure physical property, adhesive activity and solidness, of retinal precursor cells and lens core cells. If retinal organoids at day 1 are dissociated and cultured again, do they show the same patterning of internal lens core covering by the outer retinal cell sheet?

      The question, whether different adhesive activity is involved in cell sorting and lens formation is indeed very intriguing. To address this point, we will include additional experiment (see Revision Plan, experiment 1). This experiment will be based on the dissociation and re-aggregation of lens-forming organoids as suggested by the reviewer. To monitor cell type specific sorting, we will employ a lens progenitor reporter line Foxe3::GFP and the retina-specific Rx2::H2B-RFP. If different adhesive activities of lens and retinal progenitor cells are involved and drive the process of cell sorting, dissociation and re-aggregation will result in cell sorting based on their identity. 

      (2) Optic cup is evaginated from the lateral wall of neuroepithelium of the diencephalon. In zebrafish, cell movement occurs from the pigment epithelium to the neural retina during eye morphogenesis in an FGF-dependent manner. How the medaka optic cup morphogenesis is coordinated? I also wonder if the authors conduct the tracking of cell migration during optic cup morphogenesis to reveal how cell migration and cell division are regulated in lens of the Medaka retinal organoids. It is also interesting to examine how retinal cell movement is coordinated during Medaka retinal organoids.

      Looking into the detail of how optic cup-looking tissue arrangement of ocular organoids is achieved on cellular level is of course interesting. Our previous study showed that optic vesicles of medaka retinal organoids do not form optic cups (for details please see Zilova et al., 2021, eLIFE). We assume that the formation of cup-looking structure of the ocular organoids is mediated by the following processes: establishment of retina and lens domains at the specific region of the organoid – retina on the surface and lens in the center (see Figure S2 d and Figure 3e, and Figure 4). Further dislocation of the centrally formed lens towards the organoid periphery through the retina layer, places the lens to the periphery while retinal cells stay static. We assume that the “cup-like” shape is acquired by extrusion of the lens from the center of the organoid. To clarify this process with respect to tissue rearrangements and cell movements, we will include additional experiments (see Revision Plan, experiment 2) and follow lens- and retina-fated cells (by employing lens-specific Foxe3::GFP and retina-specific Rx2::H2B-RFP reporter lines) through the process of lens extrusion to dissect individual contribution of retinal/lens cells to this process (cross-reference with Reviewer #2).

      (3) The authors showed that blockade of FGF signaling affects lens fiber differentiation in day 1-2, whereas lens formation seems to be intact in the presence of FGF receptor inhibitor in day 0-1. I suggest the authors to examine which tissue is a target of FGF signaling in retinal organoids, using markers such as pea3, which is a downstream target of ERK branch of FGF signaling. Since FGF signaling promotes cell proliferation, is the lens core size normal in SU5402-treated organoids from day 0 to day 1?

      Assessing the activity of FGF signaling (cross-reference to Reviewer #3) in the organoids is indeed an important point. To address which tissue is the target of FGF signaling we will include additional experiments and assess the phosphorylation status of ERK (pERK) and expression of the ERK downstream target pea3, as suggested by the reviewer (see Revision Plan, experiment 3). That will allow to identify the tissue within the organoid responding to the Fgf signaling.

      Lens core size of organoids treated with SU5402 from day 0 to day 1 is fully comparable to the control (please see Figure 6b).

      (4) Fig. 3f and 3g indicate that there is some cell population located between foxe3:GFP+ cells and rx2:H2B-RFP+ cells. What kind of cell-type is occupied in the interface area between foxe3:GFP+ cells and rx2:H2B-RFP+ cells?

      That is for sure an interesting question. We are aware of this population of cells. We currently do not have data that would with certainty clarify the fate of those cells. We are currently following up on that question with the use of scRNA sequencing, however we will not be able to address this question in the current manuscript.

      (5) Fig. 5e indicates the depth of Rx3 expression at day 1. Is the depth the thickness of Rx3 expressing cell sheet, which covers the central lens core in the organoids? If so, I wonder if total cell number of Rx3 expressing cell sheet may be different in each seeded-cell number, because thickness is the same across each seeded-cell number, but the surface area size may be different depending on underneath the lens core size. Please clarify this point.

      Yes. Figure 5e indicates the thickness of the cell sheet expressing Rx3 that lies on the surface of the organoid. Indeed, the number of Rx3-expressing cells (and lens cells) scales with the size of the organoid as stated in the submitted manuscript.

      (6) Noggin application inhibits lens formation at day 0-1. BMP signaling regulates formation of lens placode and olfactory placode at the early stage of development. It is interesting to examine whether Noggin-treated organoid expands olfactory placode area. Please check forebrain territory markers.

      What tissue differentiates at the expense of the lens in BMP inhibitor-treated organoids is of course an intriguing question. To address the identity of cells differentiated under this condition we will include an additional experiment (see Revision Plan, experiment 4 as suggested by the reviewer). We will check for the expression of Lhx2, Otx2 and Huc/D to address this point.

      I have no minor comments

      Referees cross-commenting

      I agree that all reviewers have similar suggestions, which are reasonable and provided the same estimated time for revision.

      Reviewer #1 (Significance):

      Strength:

      This study is unique. The authors examined eye cup morphogenesis using fish retinal organoids. Eye cup normally consists of the lens, the neural retina, pigment epithelium and optic stalk. However, retinal organoids seem to be simple and consists of two cell types, lens and retina. Interestingly, a similar optic cup-like structure is achieved in both cases; however, underlying mechanism is different. It is interesting to investigate how eye morphogenesis is regulated in retinal organoids,under the unconstrained embryo-free environment.

      Limitation:

      Description is OK, but analysis is not much profound. It is necessary to apply a bit more molecular and cellular level analysis, such as tracking of cell movement and visualization of FGF signnaling in organoid tissues.

      Advancement:

      The current study is descriptive. Need some conceptual advance, which impact cell biology field or medical science.

      Audience:

      The target audience of current study are still within ophthalmology and neuroscience community people, maybe translational/clinical rather than basic biology. To beyond specific fields, need to formulate a general principle for cell and developmental biology.

      Reviewer #2 (Evidence, reproducibility and clarity):

      In this study from Stahl et al., the authors demonstrate that medaka pluripotent embryonic cells can self-organise into eye organoids containing both retina and lens tissues. While these organoids can self-organize into an eye structure that resembles the vertebrate eye, they are built from a fundamentally different morphogenetic process – an “inside-out” mechanism where the lens forms centrally and moves outward, rather than the normal “outside-in” embryonic process. This is a very interesting discovery, both for our understanding of developmental biology and the potential for tissue engineering applications. The study would benefit from some additional experiments and a few clarifications.

      The authors suggest that the lens cells are the ones that move from the central to a more superficial position. Is this an active movement of lens cells or just the passive consequence of the retina cells acquiring a cup shape? Are the retina cells migrating behind the lens or the lens cells pushing outwards? High-resolution imaging of organoid cup formation, tracking retina cells in combination with membrane labeling of all cells would help elucidate the morphogenetic processes occurring in the organoids. Membrane labeling would also be useful as Prox1 positive lens cells appear elongated in embryos while in the organoids, cell shapes seem less organised, less compact and not elongated (for example as shown in Fig 3f,g).

      Looking into the detail of how optic cup-looking tissue arrangement of ocular organoids is achieved on cellular level is of course interesting. We assume that the formation of cup-looking structures of the ocular organoids is mediated by following processes: establishment of retina and lens domains at a specific region of the organoid – retina on the surface and lens in the center (see Figure S2 d and Figure 3e, and Figure 4). Further dislocation of centrally formed lenses towards the organoid periphery through the retina layer, place the lens to the periphery while retinal cells stay static. We assume that the “cup-like” shape is acquired by extrusion of the lens. To clarify this process with respect to tissue rearrangements and cell movements, we will include additional experiments (see Revision Plan, experiment 2). We will follow lens- and retina-fated cells (by employing lens-specific Foxe3::GFP and retina-specific Rx2::H2B-RFP reporter lines) through the process of lens extrusion to dissect the individual contribution of retinal/lens cells to this process (cross-reference with Reviewer #1).

      The organoids could be a useful tool to address how cell fate is linked to cell shape acquisition. In the forming organoids, retinal tissue initially forms on the outside, while non-retinal tissue is located in the centre; this central tissue later expresses lens markers. Do the authors have any insights into why fate acquisition occurs in this pattern? Is there a difference in proliferation rates between the centrally located cells and the external ones? Could it be that highly proliferative cells give rise to neural retina (NR), while lower proliferating cells become lens?

      The question how is the retinal and lens domain established in this specific manner is indeed intriguing and very interesting. We dedicated a part of the discussion to this topic. We discuss the role of the diffusion limit and the potential contribution of BMB and FGF signaling to this arrangement. Additional experiments (see Revision Plan, experiment 3) addressing the source and target tissues of FGF and BMP signaling in the organoid will ultimately bring more clarity to our understanding of the tissue arrangements in the organoid. 

      Although analysis of the proliferation rate of the cells at the surface and in the central region of the organoid might possibly show some differences in the proliferation rates between lens and retinal cells, we do not have any indications, that the proliferation rate itself would be instructive or superior to the cell fate decisions.

      What happens in organoids that do not form lenses? Do these organoids still generate foxe3 positive cells that fail to develop into a proper lens structure? And in the absence of lens formation, does the retina still acquire a cup shape?

      Lens formation is primarily dependent on acquisition/specification of Foxe3-expressing lens placode progenitors. If those are not present, a lens does not develop. Once Foxe3-expressing progenitors are established, a lens is formed in unperturbed conditions (measured by the presence of expression of crystallin proteins). In such conditions, organoids that do not have a lens, do not carry Foxe3-expressing cells.

      In the absence of the lens, the organoid is composed of retinal neuroepithelium, that does not form an optic cup (for details of such phenotypes please see Zilova et al., 2021, eLIFE).

      The author suggest that lens formation occurs even in the absence of Matrigel. Is the process slower in these conditions? Are the resulting organoids smaller? While there are indeed some LFC expressing cells by day2, these cells are not very well organised and the pattern of expression seems dotty. Moreover, LFC staining seems to localise posterior to the LFC negative, lens-like structure (e.g. Fig.S1 3o’clock).

      How do these organoids develop beyond day 4? Do they maintain their structural integrity at later stages?

      The role of HEPES in promoting organoid formation is intriguing. Do the authors have any insights into why it is important in this context? Have the authors tried other culture conditions and does culture condition influence the morphogenetic pathways occurring within the organoids?

      We thank the reviewer for pointing this out. We were not clear in the wording and describing of our observation. Indeed, Matrigel is not required for acquisition of lens fate, which can be demonstrated with the expression of lens-specific markers. However, the presence of Matrigel has a profound impact on the structural aspects of organoid formation. Matrigel is essential for organization of retinal-committed cells into the retinal epithelium (Zilova et al., 2021, eLIFE). The absence of the structure of the retinal epithelium can indeed negatively impact on the cellular organization and the overall lens structure. To clarify the contribution of the Matrigel to the speed of organoid lens development and to the overall structure of the organoid lens we will perform additional experiments (see Revision Plan, experiment 5). With the use of Foxe3::GFP reporter line we will measure the onset of the lens-specific gene expression. In addition, we will use the immunohistochemistry to assess the gross morphology and size of the organoids grown without the Matrigel (cross-reference with Reviewer #3).

      The role of the HEPES in lens formation is indeed very intriguing and currently under investigation. As HEPES is mainly used to regulate pH of the culture media and pH might have an impact on multiple cellular processes, it will require significant time investment to dissect molecular mechanism underlying the effect of HEPES on the process of lens formation (cross reference with Reviewer #3) and therefore cannot be addressed in the current manuscript.

      Referees cross-commenting

      Pleased to see that all the other reviewers are positive about the study and raise similar concerns and comments

      Reviewer #2 (Significance):

      This is a very interesting paper, and it will be important to determine whether this alternative morphogenetic process is specific to medaka or if similar developmental routes can be recapitulated in organoid cultures from other vertebrate species.

      Reviewer #3 (Evidence, reproducibility and clarity):

      Summary:

      The manuscript by Stahl and colleagues reports an approach to generate ocular organoids composed of retinal and lens structures, derived from Medaka blastula cells. The authors present a comprehensive characterisation of the timeline followed by lens and retinal progenitors, showing these have distinct origins, and that they recapitulate the expression of differentiation markers found in vivo. Despite this molecular recapitulation, morphogenesis is strikingly different, with lens progenitors arising at the centre of the organoid, and subsequently translocating to the outside.

      Comments:

      - The manuscript presents a beautiful set of high quality images showing expression of lens differentiation markers over time in the organoids. The set of experiments is very robust, with high numbers of organoids analysed and reproducible data. The mechanism by which lens specification is promoted in these organoids is, however, poorly analysed, and the reader does not get a clear understanding of what is different in these experiments, as compared to previous attempts, to support lens differentiation. There is a mention to HEPES supplementation, but no further analysis is provided, and the fact that the process is independent of ECM contradicts, as the authors point out, previous reports. The manuscript would benefit from a more detailed analysis of the mechanisms that lead to lens differentiation in this setting.

      The role of the HEPES in lens formation is indeed very intriguing and under current investigation. As HEPES is mainly used to regulate pH of the culture media and pH might have an impact on multiple cellular processes it will require a significant time investment to dissect molecular mechanism underlying the effect of HEPES on the process of lens formation (cross reference with Reviewer #2) and therefore unfortunately cannot be addressed in the current manuscript.

      To clarify the contribution of the Matrigel to the organoid lens development we will perform additional experiments (see Revision Plan, experiment 5). With the use of Foxe3::GFP reporter line we will measure the onset of the lens-specific gene expression. In addition, we will use the immunohistochemistry to assess the gross morphology and size of the organoids grown without the Matrigel (cross-reference with Reviewer #2).

      - The markers analysed to show onset of lens differentiation in the organoids seem to start being expressed, in vivo, when the lens placode starts invaginating. An analysis of earlier stages is not presented. This would be very informative, allowing to determine whether progenitors differentiate as placode and neuroepithelium first, to subsequently continue differentiating into lens and retina, respectively. Could early placodal and anterior neural plate markers be analysed in the organoids? This would provide a more complete sequence of lens vs retina differentiation in this model.

      Yes. The figures show the expression of lens and retinal markers in the embryo in later developmental stages and the timing of their expression can be documented with higher temporal resolution. In the revised version of the manuscript, we will provide the information about the onset of expression of Rx3::H2B-GFP (retina) and Foxe3::GFP (lens) (see Author response image 1). Rx3 represents one of the earlies markers labeling the presumptive eye field within the region of the anterior neural plate (S16, late gastrula). FoxE3::GFP expression can be detected within the head surface ectoderm before the lens placode is formed showing that Foxe3 is a suitable marker of placodal progenitors in medaka.

      We are convinced that the onset of Rx3 and Foxe3-driven reporters is early enough to make the claim about the separate origin of the lens (placodal) and retinal (anterior neuroectoderm) tissues within the ocular organoids.

      Author response image 1.

      - The analysis of BMP and Fgf requirement for lens formation and differentiation is suggestive, but the source of these signals is not resolved or mentioned in the manuscript. Are BMP4 and Fgf8 expressed by the organoids? Where are they coming from?

      Indeed, addressing the source of BMP and FGF activation would bring more clarity in understanding the mechanism of retina/lens specification within the ocular organoids (cross reference with Reviewer #1). To address this point, we will include additional experiments (see Revision Plan, experiment 3). We will analyze the expression of respective ligands (Bmp4 and Fgf8) and activation of downstream effectors of BMP and FGF signaling pathways within the ocular organoids as suggested by Reviewer #1 and Reviewer #3.

      - The fact that the lens becomes specified in the centre of the organoid is striking, but it is for me difficult to visualise how it ends up being extruded from the organoid. Did the authors try to follow this process in movies? I understand that this may be technically challenging, but it would certainly help to understand the process that leads to the final organisation of retinal and lens tissues in the organoid. There is no discussion of why the morphogenetic mechanism is so different from the in vivo situation. The manuscript would benefit from explicitly discussing this.

      Following the extruding lens in vivo is indeed very relevant suggestion. To clarify the process of ocular organoid formation in the respect of tissue rearrangements and cell movements, we will include additional experiment (see Revision Plan, experiment 2). We will follow lens- and retina-fated cells (by employing lens-specific Foxe3::GFP and retina-specific Rx2::H2B-RFP reporter lines) through the process of lens extrusion (cross-reference with Reviewer #1 and Reviewer #2).

      Referees cross-commenting

      We all seem to have similar comments and concerns. I think overall the suggestions are feasible and realistic for the timeframe provided.

      Reviewer #3 (Significance):

      This study describes a reproducible approach to differentiate ocular organoids composed of lens and retinal tissues. The characterisation of lens differentiation in this model is very detailed, and despite the morphogenetic differences, the molecular mechanisms show many similarities to the in vivo situation. The manuscript however does not highlight, in my opinion, why this model may be relevant. Clearly articulating this relevance, particularly in the discussion, will enhance the study and provide more clarity to the readers regarding the significance of the study for the field of organoid research, ocular research and regenerative studies.

      Revision Plan:

      (1) To address whether differential adhesion properties of retinal and lens progenitors mediate cell sorting to establish retina and lens domains in the organoids (Reviewer #1, comment 1), we will perform dissociation of the organoids on day 1 and subsequential re-aggregation. This experiment will allow to follow cell type specific adhesion properties of lens and retinal progenitor cells. We will employ lens progenitor reporter line Foxe3::GFP and retina-specific Rx2::H2B-RFP to monitor cell type specific sorting with fluorescent microscopy.

      (2)   Multiple reviewers (Reviewer #1, Reviewer #2, Reviewer #3) asked for the presentation of detailed in vivo imaging experiment showing individual contributions of retina- and lens- fated cells to the resulting tissue organization withing the ocular organoid. We will perform in vivo live imaging experiment to follow the movements of individual lens (Foxe3::GFP) and retinal (Rx2::H2B-GFP) cells from day 1 to day 2 of organoid development to address this point.

      (3) Reviewer #1 and Reviewer #3 raised questions concerning the role of FGF and BMP signaling and sources of these signaling pathway activities in ocular organoid tissue arrangement. To address this point and bring more light into the molecular mechanisms regulating lens and retina tissue arrangement in the organoid, we will perform additional experiment. We will assess the expression of candidate FGF and BMP ligands (Fgf8, Bmp7 and Bmp4) and activation of downstream effectors (p-ERK, p-SMAD) and the direct transcriptional target of Fgf signaling (Pea3) in the developing organoids. This will allow the identification of the tissue producing the ligand on one site and tissue responding to the signaling on the other site and help out to narrow down the molecular mechanism controlling tissue arrangements in the organoid.

      (4) We will analyze the expression of forebrain territory markers in organoids treated with the BMP inhibitor to identify the identity of the tissue differentiating at the expense of lens under the BMP inhibition (suggested by Reviewer #1). We will label Noggin-treated organoids with the antibodies against Lhx2, Otx2 and HuC/D to address this point.

      (5) We will provide more comprehensive analysis of the organoids grown without the Matrigel and compare them to the organoids grown in the presence of the Matrigel (mentioned by Reviewer #2 and Reviewer #3). With the use of lens progenitor-specific Foxe3::GFP reporter line, we will measure the onset of the lens-specific gene expression. In addition, we will use the immunohistochemistry to assess the gross morphology and size of the organoids grown without the Matrigel.

      Description of analyses that authors prefer not to carry out

      Reviewer #1:

      (4) Fig. 3f and 3g indicate that there is some cell population located between foxe3:GFP+ cells and rx2:H2B-RFP+ cells. What kind of cell-type is occupied in the interface area between foxe3:GFP+ cells and rx2:H2B-RFP+ cells?

      That is for sure interesting question. We are aware of this population of cells. We currently do not have a data that would with certainty clarify the fate of those cells. We are currently following up on that question with the use of scRNA sequencing, however we will not be able to address this question in the current manuscript.

      Reviewer #2:

      The role of HEPES in promoting organoid formation is intriguing. Do the authors have any insights into why it is important in this context? Have the authors tried other culture conditions and does culture condition influence the morphogenetic pathways occurring within the organoids?

      The role of the HEPES in lens formation is indeed very intriguing and under current investigation. As HEPES is mainly used to regulate pH of the culture media and pH might have impact on multiple cellular processes it will require significant time investment to dissect molecular mechanism underlying the effect of the HEPES on the process of lens formation (cross reference with Reviewer #3) and cannot be addressed in the current manuscript.

      Is there a difference in proliferation rates between the centrally located cells and the external ones? Could it be that highly proliferative cells give rise to neural retina (NR), while lower proliferating cells become lens?

      Although analysis of the proliferation rate of the cells at the surface and in the central region of the organoid might possibly show some differences in the proliferation rates between lens and retinal cells, we do not have any indications, that the proliferation rate itself would be instructive or superior to the cell fate decisions.

    1. Author response:

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      This paper is an elegant, mostly observational work, detailing observations that polysome accumulation appears to drive nucleoid splitting and segregation. Overall I think this is an insightful work with solid observations.

      Thank you for your appreciation and positive comments. In our view, an appealing aspect of this proposed biophysical mechanism for nucleoid segregation is its self-organizing nature and its ability to intrinsically couple nucleoid segregation to biomass growth, regardless of nutrient conditions.

      Strengths:

      The strengths of this paper are the careful and rigorous observational work that leads to their hypothesis. They find the accumulation of polysomes correlates with nucleoid splitting, and that the nucleoid segregation occurring right after splitting correlates with polysome segregation. These correlations are also backed up by other observations:

      (1) Faster polysome accumulation and DNA segregation at faster growth rates.

      (2) Polysome distribution negatively correlating with DNA positioning near asymmetric nucleoids.

      (3) Polysomes form in regions inaccessible to similarly sized particles.

      These above points are observational, I have no comments on these observations leading to their hypothesis.

      Thank you!

      Weaknesses:

      It is hard to state weaknesses in any of the observational findings, and furthermore, their two tests of causality, while not being completely definitive, are likely the best one could do to examine this interesting phenomenon.

      It is indeed difficult to prove causality in a definitive manner when the proposed coupling mechanism between nucleoid segregation and gene expression is self-organizing, i.e., does not involve a dedicated regulatory molecule (e.g., a protein, RNA, metabolite) that we could have depleted through genetic engineering to establish causality. We are grateful to the reviewer for recognizing that our two causality tests are the best that can be done in this context.

      Points to consider / address:

      Notably, demonstrating causality here is very difficult (given the coupling between transcription, growth, and many other processes) but an important part of the paper. They do two experiments toward demonstrating causality that help bolster - but not prove - their hypothesis. These experiments have minor caveats, my first two points.

      (1) First, "Blocking transcription (with rifampicin) should instantly reduce the rate of polysome production to zero, causing an immediate arrest of nucleoid segregation". Here they show that adding rifampicin does indeed lead to polysome loss and an immediate halting of segregation - data that does fit their model. This is not definitive proof of causation, as rifampicin also (a) stops cell growth, and (b) stops the translation of secreted proteins. Neither of these two possibilities is ruled out fully.

      That’s correct; cell growth also stops when gene expression is inhibited, which is consistent with our model in which gene expression within the nucleoid promotes nucleoid segregation and biomass growth (i.e., cell growth), inherently coupling these two processes. This said, we understand the reviewer’s point: the rifampicin experiment doesn’t exclude the possibility that protein secretion and cell growth drive nucleoid segregation. We are assuming that the reviewer is envisioning an alternative model in which sister nucleoids would move apart because they would be attached to the membrane through coupled transcription-translation-protein secretion (transertion) and the membrane would expand between the separating nucleoids, similar to the model proposed by Jacob et al in 1963 (doi:10.1101/SQB.1963.028.01.048). There are several observations arguing against this cell elongation/transertion model.

      (1) For this alternative mechanism to work, membrane growth must be localized at the middle of the splitting nucleoids (i.e., midcell position for slow growth and ¼ and ¾ cell positions for fast growth) to create a directional motion. To our knowledge, there is no evidence of such localized membrane incorporation. Furthermore, even if membrane growth was localized at the right places, the fluidity of the cytoplasmic membrane (PMID: 6996724, 20159151, 24735432, 27705775) would be problematic. To circumvent the membrane fluidity issue, one could potentially evoke an additional connection to the rigid peptidoglycan, but then again, peptidoglycan growth would have to be localized at the middle of the splitting nucleoid. However, peptidoglycan growth is dispersed early in the cell division cycle when the nucleoid splitting happens in fast growing cells and only appears to be zonal after the onset of cell constriction (PMID: 35705811, 36097171, 2656655).

      (2) Even if we ignore the aforementioned caveats, Paul Wiggins’s group ruled out the cell elongation/transertion model by showing that the rate of cell elongation is slower than the rate of chromosome segregation (PMID: 23775792). In the revised manuscript, we wil clarify this point and provide confirmatory data showing that the cell elongation rate is indeed slower than the nucleoid segregation rate, indicating that it cannot be the main driver.

      (3) Furthermore, our correlation analysis comparing the rate of nucleoid segregation to the rate of either cell elongation or polysome accumulation argues that polysome accumulation plays a larger role than cell elongation in nucleoid segregation. These data were already shown in Figure 1H and Figure 1 – figure supplement 3 of the original manuscript but were not highlighted in this context. We will revise the text to clarify this point.

      (4) The asymmetries in nucleoid compaction that we described in our paper are predicted by our model. We do not see how they could be explained by cell growth or protein secretion.

      (5) We also show that polysome accumulation at ectopic sites (outside the nucleoid) results in correlated nucleoid dynamics, consistent with our proposed mechanism. These nucleoid dynamics cannot be explained by cell growth or protein secretion (transertion).

      (1a) As rifampicin also stops all translation, it also stops translational insertion of membrane proteins, which in many old models has been put forward as a possible driver of nucleoid segregation, and perhaps independent of growth. This should at last be mentioned in the discussion, or if there are past experiments that rule this out it would be great to note them.

      It is not clear to us how the attachment of the DNA to the cytoplasmic membrane could alone create a directional force to move the sister nucleoids. We agree that old models have proposed a role for cell elongation (providing the force) and transertion (providing the membrane tether).  Please see our response above for the evidence (from the literature and our work) against it. This was mentioned in the introduction and Results section, but we agree that this was not well explained. We will add experimental data and revise the text to clarify these points.

      (1b) They address at great length in the discussion the possibility that growth may play a role in nucleoid segregation. However, this is testable - by stopping surface growth with antibiotics. Cells should still accumulate polysomes for some time, it would be easy to see if nucleoids are still segregated, and to what extent, thereby possibly decoupling growth and polysome production. If successful, this or similar experiments would further validate their model.

      We reviewed the literature and could not find a drug that stops cell growth without stopping gene expression. Any drug that affects the membrane integrity or potential stops gene expression, which requires ATP.  However, our experiment in which we drive polysome accumulation at ectopic sites decouples polysome accumulation from cell growth. In this experiment, by redirecting most of chromosome gene expression to a single plasmid-encoded gene, we reduce the rate of cell growth but still create a large accumulation of polysomes at an ectopic location. This ectopic polysome accumulation is sufficient to affect nucleoid dynamics in a correlated fashion. In the revised manuscript, we will clarify this point and add model simulations to show that our experimental observations are predicted by our model.

      (2) In the second experiment, they express excess TagBFP2 to delocalize polysomes from midcell. Here they again see the anticorrelation of the nucleoid and the polysomes, and in some cells, it appears similar to normal (polysomes separating the nucleoid) whereas in others the nucleoid has not separated. The one concern about this data - and the differences between the "separated" and "non-separated" nuclei - is that the over-expression of TagBFP2 has a huge impact on growth, which may also have an indirect effect on DNA replication and termination in some of these cells. Could the authors demonstrate these cells contain 2 fully replicated DNA molecules that are able to segregate?

      We will perform the requested experiment.

      (3) What is not clearly stated and is needed in this paper is to explain how polysomes do (or could) "exert force" in this system to segregate the nucleoid: what a "compaction force" is by definition, and what mechanisms causes this to arise (what causes the "force") as the "compaction force" arises from new polysomes being added into the gaps between them caused by thermal motions.

      They state, "polysomes exert an effective force", and they note their model requires "steric effects (repulsion) between DNA and polysomes" for the polysomes to segregate, which makes sense. But this makes it unclear to the reader what is giving the force. As written, it is unclear if (a) these repulsions alone are making the force, or (b) is it the accumulation of new polysomes in the center by adding more "repulsive" material, the force causes the nucleoids to move. If polysomes are concentrated more between nucleoids, and the polysome concentration does not increase, the DNA will not be driven apart (as in the first case) However, in the second case (which seems to be their model), the addition of new material (new polysomes) into a sterically crowded space is not exerting force - it is filling in the gaps between the molecules in that region, space that needs to arise somehow (like via Brownian motion). In other words, if the polysome region is crowded with polysomes, space must be made between these polysomes for new polysomes to be inserted, and this space must be made by thermal (or ATP-driven) fluctuations of the molecules. Thus, if polysome accumulation drives the DNA segregation, it is not "exerting force", but rather the addition of new polysomes is iteratively rectifying gaps being made by Brownian motion.

      We apologize for the understandable confusion. In our picture, the polysomes and DNA (conceptually considered as small plectonemic segments) basically behave as dissolved particles. If these particles were noninteracting, they would simply mix. However, both polysomes and DNA segments are large enough to interact sterically. So as density increases, steric avoidance implies a reduced conformational entropy and thus a higher free energy per particle. We argue (based on Miangolarra et al. PNAS 2021 PMID: 34675077 and Xiang et al. Cell 2021 PMID: 34186018) that the demixing of polysomes and DNA segments occurs because DNA segments pack better with each other than they do with polysomes. This raises the free energy cost associated with DNA-polysome interactions compared to DNA-DNA interactions.  We model this effect by introducing a term in the free energy χ_np, which refer to as a repulsion between DNA and polysomes, though as explained above it arises from entropic effects. At realistic cellular densities of DNA and polysomes this repulsive interaction is strong enough to cause the DNA and polysomes to phase separate.

      This same density-dependent free energy that causes phase separation can also give rise to forces, just in the way that a higher pressure on one side of a wall can give rise to a net force on the wall. Indeed, the “compaction force” we refer to is fundamentally an osmotic pressure difference. At some stages during nucleoid segregation, the region of the cell between nucleoids has a higher polysome concentration, and therefore a higher osmotic pressure, than the regions near the poles. This results in a net poleward force on the sister nucleoids that drives their migration toward the poles. This migration continues until the osmotic pressure equilibrates. Therefore, both phase separation (due to the steric repulsion described above) and nonequilibrium polysome production and degradation (which creates the initial accumulation of polysomes around midcell) are essential ingredients for nucleoid segregation.

      This will be clarified in the revised text, with the support of additional simulation results.

      The authors use polysome accumulation and phase separation to describe what is driving nucleoid segregation. Both terms are accurate, but it might help the less physically inclined reader to have one term, or have what each of these means explicitly defined at the start. I say this most especially in terms of "phase separation", as the currently huge momentum toward liquid-liquid interactions in biology causes the phrase "phase separation" to often evoke a number of wider (and less defined) phenomena and ideas that may not apply here. Thus, a simple clear definition at the start might help some readers.

      Phase separation means that the DNA-polysome steric repulsion is strong enough to drive their demixing, which creates a compact nucleoid. As mentioned in a previous point, this effect is captured in the free energy by the χ_np term, which is an effective repulsion between DNA and polysomes, though as explained above it arises from entropic effects.

      In the revised manuscript, we will illustrate this with our theoretical model by initializing a cell with a diffuse nucleoid and low polysome concentration. For the sake of simplicity, we assume that the cell does not elongate. We observe that the DNA-polysome steric repulsion is sufficient to compact the nucleoid and place it at mid-cell.

      (4) Line 478. "Altogether, these results support the notion that ectopic polysome accumulation drives nucleoid dynamics". Is this right? Should it not read "results support the notion that ectopic polysome accumulation inhibits/redirects nucleoid dynamics"?

      We think that this is correct; the ectopic polysome accumulation drives nucleoid dynamics. In our theoretical model, we can introduce polysome production at fixed sources to mimic the experiments where ectopic polysome production is achieved by high plasmid expression (Fig. 6). The model is able to recapitulate the two main phenotypes observed in experiments. These new simulation results will be added to the revised manuscript.

      (5) It would be helpful to clarify what happens as the RplA-GFP signal decreases at midcell in Figure 1- is the signal then increasing in the less "dense" parts of the cell? That is, (a) are the polysomes at midcell redistributing throughout the cell? (b) is the total concentration of polysomes in the entire cell increasing over time?

      It is a redistribution—the RplA-GFP signal remains constant in concentration from cell birth to division (Figure 1 – Figure Supplement 1E). This will be clarified in the revised text.

      (6) Line 154. "Cell constriction contributed to the apparent depletion of ribosomal signal from the mid-cell region at the end of the cell division cycle (Figure 1B-C and Movie S1)" - It would be helpful if when cell constriction began and ended was indicated in Figures 1B and C.

      Good idea. We will add markers to indicate the start of cell constriction. We will also indicate that cell birth and division correspond to the first and last images/timepoint in Fig. 1B and C, respectively.

      (7) In Figure 7 they demonstrate that radial confinement is needed for longitudinal nucleoid segregation. It should be noted (and cited) that past experiments of Bacillus l-forms in microfluidic channels showed a clear requirement role for rod shape (and a given width) in the positing and the spacing of the nucleoids.

      Wu et al, Nature Communications, 2020 . "Geometric principles underlying the proliferation of a model cell system" https://dx.doi.org/10.1038/s41467-020-17988-7

      Good point. We will add this reference. Thank you.

      (8) "The correlated variability in polysome and nucleoid patterning across cells suggests that the size of the polysome-depleted spaces helps determine where the chromosomal DNA is most concentrated along the cell length. This patterning is likely reinforced through the displacement of the polysomes away from the DNA dense region"

      It should be noted this likely functions not just in one direction (polysomes dictating DNA location), but also in the reverse - as the footprint of compacted DNA should also exclude (and thus affect) the location of polysomes

      We agree that the effects could go both ways at this early stage of the story. We will revise the text accordingly.  

      (9) Line 159. Rifampicin is a transcription inhibitor that causes polysome depletion over time. This indicates that all ribosomal enrichments consist of polysomes and therefore will be referred to as polysome accumulations hereafter". Here and throughout this paper they use the term polysome, but cells also have monosomes (and 2 somes, etc). Rifampicin stops the assembly of all of these, and thus the loss of localization could occur from both. Thus, is it accurate to state that all transcription events occur in polysomes? Or are they grouping all of the n-somes into one group?

      In the discussion, we noted that our term “polysomes” also includes monosomes for simplicity, but we agree that the term should have been defined much earlier. This will be done in the revised manuscript.

      Thank you for the valuable comments and suggestions!

      Reviewer #2 (Public review):

      Summary:

      The authors perform a remarkably comprehensive, rigorous, and extensive investigation into the spatiotemporal dynamics between ribosomal accumulation, nucleoid segregation, and cell division. Using detailed experimental characterization and rigorous physical models, they offer a compelling argument that nucleoid segregation rates are determined at least in part by the accumulation of ribosomes in the center of the cell, exerting a steric force to drive nucleoid segregation prior to cell division. This evolutionarily ingenious mechanism means cells can rely on ribosomal biogenesis as the sole determinant for the growth rate and cell division rate, avoiding the need for two separate 'sensors,' which would require careful coupling.

      Terrific summary! Thank you for your positive assessment.

      Strengths:

      In terms of strengths; the paper is very well written, the data are of extremely high quality, and the work is of fundamental importance to the field of cell growth and division. This is an important and innovative discovery enabled through a combination of rigorous experimental work and innovative conceptual, statistical, and physical modeling.

      Thank you!

      Weaknesses:

      In terms of weaknesses, I have three specific thoughts.

      Firstly, my biggest question (and this may or may not be a bona fide weakness) is how unambiguously the authors can be sure their ribosomal labeling is reporting on polysomes, specifically. My reading of the work is that the loss of spatial density upon rifampicin treatment is used to infer that spatial density corresponds to polysomes, yet this feels like a relatively indirect way to get at this question, given rifampicin targets RNA polymerase and not translation. It would be good if a more direct way to confirm polysome dependence were possible.

      The heterogeneity of ribosome distribution inside E. coli cells has been attributed to polysomes by many labs (PMID: 25056965, 38678067, 22624875, 31150626, 34186018, 10675340).  The attribution is also consistent with single-molecule tracking experiments showing that slow-moving ribosomes (polysomes) are excluded by the nucleoid whereas fast-diffusing ribosomes (free ribosomal subunits) are distributed throughout the cytoplasm (PMID: 25056965, 22624875).

      Furthermore, inhibition of translation initiation with kasugamycin treatment, which decreases the pool of polysomes, results in a homogenization of ribosomes and expansion of the nucleoid (see Author response image 1). This further supports the rifampicin experiments. Given that the attribution of ribosome heterogeneity to polysomes is well accepted in the field, we would prefer to not include these kasugamycin data in the revised manuscript because long-term exposure to this drug leads to nucleoid re-compaction (PMID: 25250841 and PMID: 34186018). This secondary effect may possibly be due to a dysregulated increase in synthesis of naked rRNAs (PMID: 14460744, PMID: 2114400, and PMID: 2448483) or ribosome aggregation, which we are currently investigating.

      Author response image 1.

      Effects of kasugamycin treatment on the intracellular distribution of ribosomes and nucleoids. Representative single cell (CJW7323) growing in M9gluCAAT.  Kasugamycin (3 mg/mL) was added at time = 0 min. Show is the early response (0-30 min) to the drug characterized by the homogenization of the ribosomal RplA-GFP fluorescence and the expansion of the HupA-mCherry-labeled nucleoids. For each segmented cell, the RplA-GFP and HupA-mCherry signals were normalized by the average fluorescence.

      Second, the authors invoke a phase separation model to explain the data, yet it is unclear whether there is any particular evidence supporting such a model, whether they can exclude simpler models of entanglement/local diffusion (and/or perhaps this is what is meant by phase separation?) and it's not clear if claiming phase separation offers any additional insight/predictive power/utility. I am OK with this being proposed as a hypothesis/idea/working model, and I agree the model is consistent with the data, BUT I also feel other models are consistent with the data. I also very much do not think that this specific aspect of the paper has any bearing on the paper's impact and importance.

      We appreciate the reviewer’s comment, but the output of our reaction-diffusion model is a bona fide phase separation (spinodal decomposition). So, we feel that we need to use the term when reporting the modeling results. Inside the cell, the situation is more complicated. As the reviewer points out, there likely are entanglements (not considered in our model) and other important factors (please see our discussion on the model limitations). This said, we will revise our text to clarify our terms and proposed mechanism.

      Finally, the writing and the figures are of extremely high quality, but the sheer volume of data here is potentially overwhelming. I wonder if there is any way for the authors to consider stripping down the text/figures to streamline things a bit? I also think it would be useful to include visually consistent schematics of the question/hypothesis/idea each of the figures is addressing to help keep readers on the same page as to what is going on in each figure. Again, there was no figure or section I felt was particularly unclear, but the sheer volume of text/data made reading this quite the mental endurance sport! I am completely guilty of this myself, so I don't think I have any super strong suggestions for how to fix this, but just something to consider.

      We agree that there is a lot to digest. We will add schematics and a didactic simulation. We will also try to streamline the text.

      Reviewer #3 (Public review):

      Summary:

      Papagiannakis et al. present a detailed study exploring the relationship between DNA/polysome phase separation and nucleoid segregation in Escherichia coli. Using a combination of experiments and modelling, the authors aim to link physical principles with biological processes to better understand nucleoid organisation and segregation during cell growth.

      Strengths:

      The authors have conducted a large number of experiments under different growth conditions and physiological perturbations (using antibiotics) to analyse the biophysical factors underlying the spatial organisation of nucleoids within growing E. coli cells. A simple model of ribosome-nucleoid segregation has been developed to explain the observations.

      Weaknesses:

      While the study addresses an important topic, several aspects of the modelling, assumptions, and claims warrant further consideration.

      Thank you for your feedback. Please see below for a response to each concern. 

      Major Concerns:

      Oversimplification of Modelling Assumptions:

      The model simplifies nucleoid organisation by focusing on the axial (long-axis) dimension of the cell while neglecting the radial dimension (cell width). While this approach simplifies the model, it fails to explain key experimental observations, such as:

      (1) Inconsistencies with Experimental Evidence:

      The simplified model presented in this study predicts that translation-inhibiting drugs like chloramphenicol would maintain separated nucleoids due to increased polysome fractions. However, experimental evidence shows the opposite-separated nucleoids condense into a single lobe post-treatment (Bakshi et al 2014), indicating limitations in the model's assumptions/predictions. For the nucleoids to coalesce into a single lobe, polysomes must cross the nucleoid zones via the radial shells around the nucleoid lobes.

      We do not think that the results from chloramphenicol-treated cells are inconsistent with our model. Our proposed mechanism predicts that nucleoids will condense in the presence of chloramphenicol, consistent with experiments. It also predicts that nucleoids that were still relatively close at the time of chloramphenicol treatment could fuse if they eventually touched through diffusion (thermal fluctuation) to reduce their interaction with the polysomes and minimize their conformational energy. Fusion is, however, not expected for well-separated nucleoids since their diffusion is slow in the crowded cytoplasm. This is consistent with our experimental observations: In the presence of a growth-inhibitory concentration of chloramphenicol (70 μg/mL), nucleoids in relatively close proximity can fuse, but well-separated nucleoids condense and do not fuse. Since the growth rate inhibition is not immediate upon chloramphenicol treatment, many cells with well-separated condensed nucleoids divide during the first hour. As a result, the non-fusion phenotype is more obvious in non-dividing cells, achieved by pre-treating cells with the cell division inhibitor cephalexin (50μg/mL). In these polyploid elongated cells, well-separated nucleoids condensed but did not fuse, not even after an hour in the presence of chloramphenicol (as illustrated in Author response image 2).

      In Bakshi et al, 2014, nucleoid fusion was shown for a single cell in which the sister nucleoids were relatively close to each other at the time of chloramphenicol treatment. Population statistics were provided for the relative length and width of the nucleoids, but not for the fusion events. So, it is unclear whether the illustrated fusion was universal or not. Also, we note that Bakshi et al (2014) used a chloramphenicol concentration of 300 μg/mL, which is 20-fold higher than the minimal inhibitory concentration for growth, compared to 70 μg/mL in our experiments.

      Author response image 2.

      Effects of chloramphenicol treatment on the intracellular distribution of ribosomes and nucleoids in non-dividing cells. Exponentially growing cells (M9glyCAAT at 30°C) were pre-treated with cephalexin for one hour before being spotted on an 1% agarose pad for time-lapse imaging. The agarose pad contained M9glyCAAT, cephalexin, and chloramphenicol.  (A) Phase contrast, RplA-GFP fluorescence and HupA-mCherry fluorescence images of a representative single cell. Three timepoints are shown, including the first image after spotting on the agarose pad (at 0 min), 30 minutes and one hour of chloramphenicol treatment. (B) One-dimensional profiles of the ribosomal (RplA-GFP) and nucleoid (HupA-mCherry) fluorescence from the cells shown in panel A. These intensity profiles correspond to the average fluorescence along the medial axis of the cell considering a 6-pixel region (0.4 μm) centered on the central line of the cell. The fluorescence intensity is plotted along the relative cell length, scaled from 0 to 100% between the two poles, illustrating the relative nucleoid length (L<sub>DNA</sub>/L<sub>cell</sub>) that was plotted by Bakshi et al in 2014 (PMID: 25250841).

      (2) The peripheral localisation of nucleoids observed after A22 treatment in this study and others (e.g., Japaridze et al., 2020; Wu et al., 2019), which conflicts with the model's assumptions and predictions. The assumption of radial confinement would predict nucleoids to fill up the volume or ribosomes to go near the cell wall, not the nucleoid, as seen in the data.

      The reviewer makes a good point that DNA attachment to the membrane through transertion likely contributes to the nucleoid being peripherally localized in A22 cells. We will revise the text to add this point. However, we do not think that this contradicts the proposed nucleoid segregation mechanism based on phase separation and out-of-equilibrium dynamics described in our model. On the contrary, by attaching the nucleoid to the cytoplasmic membrane along the cell width, transertion might help reduce the diffusion and thus exchange of polysomes across nucleoids. We will revise the text to discuss transertion over radial confinement.

      (3) The radial compaction of the nucleoid upon rifampicin or chloramphenicol treatment, as reported by Bakshi et al. (2014) and Spahn et al. (2023), also contradicts the model's predictions. This is not expected if the nucleoid is already radially confined.

      We originally evoked radial confinement to explain the observation that polysome accumulations do not equilibrate between DNA-free regions. We agree that transertion is an alternative explanation. Thank you for bringing it to our attention. However, please note that this does not contradict the model. In our view, it actually supports the 1D model by providing a reasonable explanation for the slow exchange of polysomes across DNA-free regions. The attachment of the nucleoid to the membrane along the cell width may act as diffusion barrier. We will revise the text and the title of the manuscript accordingly.

      (4) Radial Distribution of Nucleoid and Ribosomal Shell:

      The study does not account for well-documented features such as the membrane attachment of chromosomes and the ribosomal shell surrounding the nucleoid, observed in super-resolution studies (Bakshi et al., 2012; Sanamrad et al., 2014). These features are critical for understanding nucleoid dynamics, particularly under conditions of transcription-translation coupling or drug-induced detachment. Work by Yongren et al. (2014) has also shown that the radial organisation of the nucleoid is highly sensitive to growth and the multifork nature of DNA replication in bacteria.

      We will discuss the membrane attachment. Please see the previous response.

      The omission of organisation in the radial dimension and the entropic effects it entails, such as ribosome localisation near the membrane and nucleoid centralisation in expanded cells, undermines the model's explanatory power and predictive ability. Some observations have been previously explained by the membrane attachment of nucleoids (a hypothesis proposed by Rabinovitch et al., 2003, and supported by experiments from Bakshi et al., 2014, and recent super-resolution measurements by Spahn et al.).

      We agree—we will add a discussion about membrane attachment in the radial dimension. See previous responses.

      Ignoring the radial dimension and membrane attachment of nucleoid (which might coordinate cell growth with nucleoid expansion and segregation) presents a simplistic but potentially misleading picture of the underlying factors.

      As mentioned above, we will discuss membrane attachment in the revised manuscript.

      This reviewer suggests that the authors consider an alternative mechanism, supported by strong experimental evidence, as a potential explanation for the observed phenomena:

      Nucleoids may transiently attach to the cell membrane, possibly through transertion, allowing for coordinated increases in nucleoid volume and length alongside cell growth and DNA replication. Polysomes likely occupy cellular spaces devoid of the nucleoid, contributing to nucleoid compaction due to mutual exclusion effects. After the nucleoids separate following ter separation, axial expansion of the cell membrane could lead to their spatial separation.

      This “membrane attachment/cell elongation” model is reminiscent to the hypothesis proposed by Jacob et al in 1963 (doi:10.1101/SQB.1963.028.01.048). There are several lines of evidence arguing against it as the major driver of nucleoid segregation:

      (Below is a slightly modified version of our response to a comment from Reviewer 1—see page 3)

      (1) For this alternative model to work, axial membrane expansion (i.e., cell elongation) would have to be localized at the middle of the splitting nucleoids (i.e., midcell position for slow growth and ¼ and ¾ cell positions for fast growth) to create a directional motion. To our knowledge, there is no evidence of such localized membrane incorporation.  Furthermore, even if membrane growth was localized at the right places, the fluidity of the cytoplasmic membrane (PMID: 6996724, 20159151, 24735432, 27705775) would be problematic. To go around this fluidity issue, one could potentially evoke a potential connection to the rigid peptidoglycan, but then again, peptidoglycan growth would have to be localized at the middle of the splitting nucleoid to “push” the sister nucleoid apart from each other. However, peptidoglycan growth is dispersed prior to cell constriction (PMID: 35705811, 36097171, 2656655).

      (2) Even if we ignore the aforementioned caveats, Paul Wiggins’s group ruled out the cell elongation/transertion model by showing that the rate of cell elongation is slower than the rate of chromosome segregation (PMID: 23775792). In the revised manuscript, we will provide additional data showing that the cell elongation rate is indeed slower than the nucleoid segregation rate.

      (3) Furthermore, our correlation analysis comparing the rate of nucleoid segregation to the rate of either cell elongation or polysome accumulation argues that polysome accumulation plays a larger role than cell elongation in nucleoid segregation. These data were already shown in the original manuscript (Figure 1I and Figure 1 – figure supplement 3) but were not highlighted in this context. We will revise the text to clarify this point.

      (4) The membrane attachment/cell elongation model does not explain the nucleoid asymmetries described in our paper (Figure 3), whereas they can be recapitulated by our model.

      (5) The cell elongation/transertion model cannot predict the aberrant nucleoid dynamics observed when chromosomal expression is largely redirected to plasmid expression. In the revised manuscript, we will add simulation results showing that these nucleoid dynamics are predicted by our model.

      In line of these arguments, we do not believe that a mechanism based on membrane attachment and cell elongation is the major driver of nucleoid segregations. However, we do believe that it may play a complementary role (see “Nucleoid segregation likely involves multiple factors” in the Discussion). We will revise this section to clarify our thoughts and mention the potential role of transertion.

      Incorporating this perspective into the discussion or future iterations of the model may provide a more comprehensive framework that aligns with the experimental observations in this study and previous work.

      As noted above, we will revise the text to mention about transertion.

      Simplification of Ribosome States:

      Combining monomeric and translating ribosomes into a single 'polysome' category may overlook spatial variations in these states, particularly during ribosome accumulation at the mid-cell. Without validating uniform mRNA distribution or conducting experimental controls such as FRAP or single-molecule measurements to estimate the proportions of ribosome states based on diffusion, this assumption remains speculative.

      Indeed, for simplicity, we adopt an average description of all polysomes with an average diffusion coefficient and interaction parameters, which is sufficient for capturing the fundamental mechanism underlying nucleoid segregation. To illustrate that considering multiple polysome species does not change the physical picture, we consider an extension of our model, which contains three polysome species, each with a different diffusion coefficient (D<SUB>P</SUB> = 0.018, 0.023, or 0.028 μm<sup>2</sup>/s), reflecting that polysomes with more ribosomes will have a lower diffusion coefficient. Simulation of this model reveals that the different polysome species have essentially the same concentration distribution, suggesting that the average description in our minimal model is sufficient for our purposes. We will present these new simulation results in the revised manuscript.

    1. Author response:

      eLife assessment

      This study provides valuable information on the mechanism of PepT2 through enhanced-sampling molecular dynamics, backed by cell-based assays, highlighting the importance of protonation of selected residues for the function of a proton-coupled oligopeptide transporter (hsPepT2). The molecular dynamics approaches are convincing, but with limitations that could be addressed in the manuscript, including lack of incorporation of a protonation coordinate in the free energy landscape, possibility of protonation of the substrate, errors with the chosen constant pH MD method for membrane proteins, dismissal of hysteresis emerging from the MEMENTO method, and the likelihood of other residues being affected by peptide binding. Some changes to the presentation could be considered, including a better description of pKa calculations and the inclusion of error bars in all PMFs. Overall, the findings will appeal to structural biologists, biochemists, and biophysicists studying membrane transporters.

      We would like to express our gratitude to the reviewers for providing their feedback on our manuscript, and also for recognising the variety of computational methods employed, the amount of sampling collected and the experimental validation undertaken. Following the individual reviewer comments, as addressed point-by-point below, we will shortly prepare a revised version of this paper. Intended changes to the revised manuscript are marked up in bold font in the detailed responses below, but before that we address some of the comments made above in the general assessment:

      • “lack of incorporation of a protonation coordinate in the free energy landscape”. We acknowledge that of course it would be highly desirable to treat protonation state changes explicitly and fully coupled to conformational changes. However, at this point in time, evaluating such a free energy landscape is not computationally feasible (especially considering that the non-reactive approach taken here already amounts to almost 1ms of total sampling time). Previous reports in the literature tend to focus on either simpler systems or a reduced subset of a larger problem. As we were trying to obtain information on the whole transport cycle, we decided to focus here on non-reactive methods.

      • “possibility of protonation of the substrate”. The reviewers are correct in pointing out this possibility, which we had not discussed explicitly in our manuscript. Briefly, while we describe a mechanism in which protonation of only protein residues (with an unprotonated ligand) can account for driving all the necessary conformational changes of the transport cycle, there is some evidence for a further intermediate protonation site in our data (as we commented on in the first version of the manuscript as well), which may or may not be the substrate itself. A future explicit treatment of the proton movements through the transporter, when it will become computationally tractable to do so, will have to include the substrate as a possible protonation site; for the present moment, we will amend our discussion to alert the reader to the possibility that the substrate could be an intermediate to proton transport. This has repercussions for our study of the E56 pKa value, where – if protons reside with a significant population at the substrate C-terminus – our calculated shift in pKa upon substrate binding could be an overestimate, although we would qualitatively expect the direction of shift to be unaffected. However, we also anticipate that treating this potential coupling explicitly would make convergence of any CpHMD calculation impractical to achieve and thus it may be the case that for now only a semi-quantitative conclusion is all that can be obtained.

      • “errors with the chosen constant pH MD method for membrane proteins”. We acknowledge that – as reviewer #1 has reminded us – the AMBER implementation of hybrid-solvent CpHMD is not rigorous for membrane proteins, and as such we will add a cautionary note to our paper. We will also explain how the use of the ABFE thermodynamic cycle calculations helps to validate the CpHMD results in a completely orthogonal manner (we will promote this validation which was in the supplementary figures into the main text in the revised version). We therefore remain reasonably confident in the results presented with regards to the reported pKa shift of E56 upon substrate binding, and suggest that if the impact of neglecting the membrane in the implicit-solvent stage of CpHMD is significant, then there is likely an error cancellation when considering shifts induced by the incoming substrate.

      • “dismissal of hysteresis emerging from the MEMENTO method”. We have shown in our method design paper how the use of the MEMENTO method drastically reduces hysteresis compared to steered MD and metadynamics for path generation, and find this improvement again for PepT2 in this study. We will address reviewer #3’s concern about our presentation on this point by revising our introduction of the MEMENTO method, as detailed in the response below.

      • “the likelihood of other residues being affected by peptide binding”. In this study, we have investigated in detail the involvement of several residues in proton-coupled di-peptide transport by PepT2. Short of the potential intermediate protonation site mentioned above, the set of residues we investigate form a minimal set of sorts within which the important driving forces of alternating access can be rationalised. We have not investigated in substantial detail here the residues involved in holding the peptide in the binding site, as they are well studied in the literature and ligand promiscuity is not the problem of interest here. It remains entirely possible that further processes contribute to the mechanism of driving conformational changes by involving other residues not considered in this paper. We will make our speculation that an ensemble of different processes may be contributing simultaneously more explicit in our revision, but do not believe any of our conclusions would be affected by this.

      As for the additional suggested changes in presentation, we will provide the requested details on the CpHMD analysis. Furthermore, we will use the convergence data presented separately in figures S12 and S16 to include error bars on our 1D-reprojections of the 2D-PMFs in figures 3, 4 and 5. (Note that we will opt to not do so in figures S10 and S15 which collate all 1D PMF reprojections for the OCC ↔ OF and OCC ↔ IF transitions in single reference plots, respectively, to avoid overcrowding those necessarily busy figures). We are also changing the colours schemes of these plots in our revision to improve accessibility.

      Reviewer #1 (Public Review):

      The authors have performed all-atom MD simulations to study the working mechanism of hsPepT2. It is widely accepted that conformational transitions of proton-coupled oligopeptide transporters (POTs) are linked with gating hydrogen bonds and salt bridges involving protonatable residues, whose protonation triggers gate openings. Through unbiased MD simulations, the authors identified extra-cellular (H87 and D342) and intra-cellular (E53 and E622) triggers. The authors then validated these triggers using free energy calculations (FECs) and assessed the engagement of the substrate (Ala-Phe dipeptide). The linkage of substrate release with the protonation of the ExxER motif (E53 and E56) was confirmed using constant-pH molecular dynamics (CpHMD) simulations and cellbased transport assays. An alternating-access mechanism was proposed. The study was largely conducted properly, and the paper was well-organized. However, I have a couple of concerns for the authors to consider addressing.

      We would like to note here that it may be slightly misleading to the reader to state that “The linkage of substrate release with the protonation of the ExxER motif (E53 and E56) was confirmed using constant-pH molecular dynamics (CpHMD) simulations and cell-based transport assays.” The cellbased transport assays confirmed the importance of the extracellular gating trigger residues H87, S321 and D342 (as mentioned in the preceding sentence), not of the substrate-protonation link as this line might be understood to suggest.

      (1) As a proton-coupled membrane protein, the conformational dynamics of hsPepT2 are closely coupled to protonation events of gating residues. Instead of using semi-reactive methods like CpHMD or reactive methods such as reactive MD, where the coupling is accounted for, the authors opted for extensive non-reactive regular MD simulations to explore this coupling. Note that I am not criticizing the choice of methods, and I think those regular MD simulations were well-designed and conducted. But I do have two concerns.

      a) Ideally, proton-coupled conformational transitions should be modelled using a free energy landscape with two or more reaction coordinates (or CVs), with one describing the protonation event and the other describing the conformational transitions. The minimum free energy path then illustrates the reaction progress, such as OCC/H87D342- → OCC/H87HD342H → OF/H87HD342H as displayed in Figure 3.

      We concur with the reviewer that the ideal way of describing the processes studied in our paper would be as a higher-dimensional free energy landscapes obtained from a simulation method that can explicitly model proton-transfer processes. Indeed, it would have been particularly interesting and potentially informative with regards to the movement of protons down into the transporter in the OF → OCC → IF sequence of transitions. As we note in our discussion on the H87→E56 proton transfer:

      “This could be investigated using reactive MD or QM/MM simulations (both approaches have been employed for other protonation steps of prokaryotic peptide transporters, see Parker et al. (2017) and Li et al. (2022)). However, the putative path is very long (≈ 1.7 nm between H87 and E56) and may or may not involve a large number of intermediate protonatable residues, in addition to binding site water. While such an investigation is possible in principle, it is beyond the scope of the present study.”

      Where even sampling the proton transfer step itself in an essentially static protein conformation would be pushing the boundaries of what has been achieved in the field, we believe that considering the current state-of-the-art, a fully coupled investigation of large-scale conformational changes and proton-transfer reaction is not yet feasible in a realistic/practical time frame. We also note this limitation already when we say that:

      “The question of whether proton binding happens in OCC or OF warrants further investigation, and indeed the co-existence of several mechanisms may be plausible here”.

      Nonetheless, we are actively exploring approaches to treat uptake and movement of protons explicitly for future work.

      In our revision, we will expand on our discussion of the reasoning behind employing a nonreactive approach and the limitations that imposes on what questions can be answered in this study.

      Without including the protonation as a CV, the authors tried to model the free energy changes from multiple FECs using different charge states of H87 and D342. This is a practical workaround, and the conclusion drawn (the OCC→ OF transition is downhill with protonated H87 and D342) seems valid. However, I don't think the OF states with different charge states (OF/H87D342-, OF/H87HD342-, OF/H87D342H, and OF/H87HD342H) are equally stable, as plotted in Figure 3b. The concern extends to other cases like Figures 4b, S7, S10, S12, S15, and S16. While it may be appropriate to match all four OF states in the free energy plot for comparison purposes, the authors should clarify this to ensure readers are not misled.

      The reviewer is correct in their assessment that the aligning of PMFs in these figures is arbitrary; no relative free energies of the PMFs to each other can be estimated without explicit free energy calculations at least of protonation events at the end state basins. The PMFs in our figures are merely superimposed for illustrating the differences in shape between the obtained profiles in each condition, as discussed in the text, and we will make this clear in the appropriate figure captions in our revision.

      b) Regarding the substrate impact, it appears that the authors assumed fixed protonation states. I am afraid this is not necessarily the case. Variations in PepT2 stoichiometry suggest that substrates likely participate in proton transport, like the Phe-Ala (2:1) and Phe-Gln (1:1) dipeptides mentioned in the introduction. And it is not rigorous to assume that the N- and C-termini of a peptide do not protonate/deprotonate when transported. I think the authors should explicitly state that the current work and the proposed mechanism (Figure 8) are based on the assumption that the substrates do not uptake/release proton(s).

      This is indeed an assumption inherent in the current work. While we do “speculate that the proton movement processes may happen as an ensemble of different mechanisms, and potentially occur contemporaneously with the conformational change” we do not in the current version indicate explicitly that this may involve the substrate. We will make clear the assumption and this possibility in the revised version of our paper. Indeed, as we discuss, there is some evidence in our PMFs of an additional protonation site not considered thus far, which may or may not be the substrate. We will make note of this point in the revised manuscript.

      As for what information can be drawn from the given experimental stoichiometries, we note in our paper that “a 2:1 stoichiometry was reported for the neutral di-peptide D-Phe-L-Ala and 3:1 for anionic D-Phe-L-Glu. (Chen et al., 1999) Alternatively, Fei et al. (1999) have found 1:1 stoichiometries for either of D-Phe-L-Gln (neutral), D-Phe-L-Glu (anionic), and D-Phe-L-Lys (cationic).”

      We do not assume that it is our place to arbit among the apparent discrepancies in the experimental data here, although we believe that our assumed 2:1 stoichiometry is additionally “motivated also by our computational results that indicate distinct and additive roles played by two protons in the conformational cycle mechanism”.

      (2) I have more serious concerns about the CpHMD employed in the study.

      a) The CpHMD in AMBER is not rigorous for membrane simulations. The underlying generalized Born model fails to consider the membrane environment when updating charge states. In other words, the CpHMD places a membrane protein in a water environment to judge if changes in charge states are energetically favorable. While this might not be a big issue for peripheral residues of membrane proteins, it is likely unphysical for internal residues like the ExxER motif. As I recall, the developers have never used the method to study membrane proteins themselves. The only CpHMD variant suitable for membrane proteins is the membrane-enabled hybrid-solvent CpHMD in CHARMM. While I do not expect the authors to redo their CpHMD simulations, I do hope the authors recognize the limitations of their method.

      We will discuss the limitations of the AMBER CpHMD implementation in the revised version. However, despite that, we believe we have in fact provided sufficient grounds for our conclusion that substrate binding affects ExxER motif protonation in the following way:

      In addition to CpHMD simulations, we establish the same effect via ABFE calculations, where the substrate affinity is different at the E56 deprotonated vs protonated protein. This is currently figure S20, though in the revised version we will move this piece of validation into a new panel of figure 6 in the main text, since it becomes more important with the CpHMD membrane problem in mind. Since the ABFE calculations are conducted with an all-atom representation of the lipids and the thermodynamic cycle closes well, it would appear that if the chosen CpHMD method has a systematic error of significant magnitude for this particular membrane protein system, there may be the benefit of error cancellation. While the calculated absolute pKa values may not be reliable, the difference made by substrate binding appears to be so, as judged by the orthogonal ABFE technique.

      Although the reviewer does “not expect the authors to redo their CpHMD simulations”, we consider that it may be helpful to the reader to share in this response some results from trials using the continuous, all-atom constant pH implementation that has recently become available in GROMACS (Aho et al 2022, https://pubs.acs.org/doi/10.1021/acs.jctc.2c00516) and can be used rigorously with membrane proteins, given its all-atom lipid representation.

      Unfortunately, when trying to titrate E56 in this CpHMD implementation, we found few protonationstate transitions taking place, and the system often got stuck in protonation state–local conformation coupled minima (which need to interconvert through rearrangements of the salt bridge network involving slow side-chain dihedral rotations in E53, E56 and R57). Author response image 1 shows this for the apo OF state, Author response image 2 shows how noisy attempts at pKa estimation from this data turn out to be, necessitating the use of a hybrid-solvent method.

      Author response image 1.

      All-atom CpHMD simulations of apo-OF PepT2. Red indicates protonated E56, blue is deprotonated.

      Author response image 2.

      Difficulty in calculating the E56 pKa value from the noisy all-atom CpHMD data shown in Author response image 1

      b) It appears that the authors did not make the substrate (Ala-Phe dipeptide) protonatable in holosimulations. This oversight prevents a complete representation of ligand-induced protonation events, particularly given that the substrate ion pairs with hsPepT2 through its N- & C-termini. I believe it would be valuable for the authors to acknowledge this potential limitation.

      In this study, we implicitly assumed from the outset that the substrate does not get protonated, which – as by way of response to the comment above – we will acknowledge explicitly in revision. This potential limitation for the available mechanisms for proton transfer also applies to our investigation of the ExxER protonation states. In particular, a semi-grand canonical ensemble that takes into account the possibility of substrate C-terminus protonation may also sample states in which the substrate is protonated and oriented away from R57, thus leaving the ExxER salt bridge network in an apo-like state. The consequence would be that while the direction of shift in E56 pKa value will be the same, our CpHMD may overestimate its magnitude. It would thus be interesting to make the C-terminus protonatable for obtaining better quantitative estimates of the E56 pKa shift (as is indeed true in general for any other protein protonatable residue, though the effects are usually assumed to be negligible). We do note, however, that convergence of the CpHMD simulations would be much harder if the slow degree of freedom of substrate reorientation (which in our experience takes 10s to 100s of ns in this binding pocket) needs to be implicitly equilibrated upon protonation state transitions. We will discuss such considerations in the revision.

      Reviewer #2 (Public Review):

      This is an interesting manuscript that describes a series of molecular dynamics studies on the peptide transporter PepT2 (SLC15A2). They examine, in particular, the effect on the transport cycle of protonation of various charged amino acids within the protein. They then validate their conclusions by mutating two of the residues that they predict to be critical for transport in cell-based transport assays. The study suggests a series of protonation steps that are necessary for transport to occur in Petp2. Comparison with bacterial proteins from the same family shows that while the overall architecture of the proteins and likely mechanism are similar, the residues involved in the mechanism may differ.

      Strengths:

      This is an interesting and rigorous study that uses various state-of-the-art molecular dynamics techniques to dissect the transport cycle of PepT2 with nearly 1ms of sampling. It gives insight into the transport mechanism, investigating how the protonation of selected residues can alter the energetic barriers between various states of the transport cycle. The authors have, in general, been very careful in their interpretation of the data.

      Weaknesses:

      Interestingly, they suggest that there is an additional protonation event that may take place as the protein goes from occluded to inward-facing but they have not identified this residue.

      We have indeed suggested that there may be an additional protonation site involved in the conformational cycle that we have not been able to capture, which – as we discuss in our paper – might be indicated by the shapes of the OCC ↔ IF PMFs given in Figure S15. One possibility is for this to be the substrate itself (see the response to reviewer #1 above) though within the scope of this study the precise pathway by which protons move down the transporter and the exact ordering of conformational change and proton transfer reactions remains a (partially) open question. We acknowledge this and denote it with question marks in the mechanistic overview we give in Figure 8, and also “speculate that the proton movement processes may happen as an ensemble of different mechanisms, and potentially occur contemporaneously with the conformational change”.

      Some things are a little unclear. For instance, where does the state that they have defined as occluded sit on the diagram in Figure 1a? - is it truly the occluded state as shown on the diagram or does it tend to inward- or outward-facing?

      Figure 1a is a simple schematic overview intended to show which structures of PepT2 homologues are available to use in simulations. This was not meant to be a quantitative classification of states. Nonetheless, we can note that the OCC state we derived has extra- and intracellular gate opening distances (as measured by the simple CVs defined in the methods and illustrated in Figure 2a) that indicate full gate closure at both sides. In particular, although it was derived from the IF state via biased sampling, the intracellular gate opening distance in the OCC state used for our conformational change enhanced sampling was comparable to that of the OF state (ie, full closure of the gate), see Figure S2b and the grey bars therein. Therefore, we would schematically classify the OCC state to lie at the center of the diagram in Figure 1a. Furthermore, it is largely stable over triplicates of 1 μslong unbiased MD, where in 2/3 replicates the gates remain stable, and the remaining replicate there is partial opening of the intracellular gate (as shown in Figure 2 b/c under the “apo standard” condition). We comment on this in the main text by saying that “The intracellular gate, by contrast, is more flexible than the extracellular gate even in the apo, standard protonation state”, and link it to the lower barrier for transition to IF than to OF. We did this by saying that “As for the OCC↔OF transitions, these results explain the behaviour we had previously observed in the unbiased MD of Figure 2c.” We acknowledge this was not sufficiently clear and will add details to the latter sentence in revision to help clarify better the nature of the occluded state.

      The pKa calculations and their interpretation are a bit unclear. Firstly, it is unclear whether they are using all the data in the calculations of the histograms, or just selected data and if so on what basis was this selection done. Secondly, they dismiss the pKa calculations of E53 in the outward-facing form as not being affected by peptide binding but say that E56 is when there seems to be a similar change in profile in the histograms.

      In our manuscript, we have provided two distinct analyses of the raw CpHMD data. Firstly, we analysed the data by the replicates in which our simulations were conducted (Figure 6, shown as bar plots with mean from triplicates +/- standard deviation), where we found that only the effect on E56 protonation was distinct as lying beyond the combined error bars. This analysis uses the full amount of sampling conducted for each replicate. However, since we found that the range of pKa values estimated from 10ns/window chunks was larger than the error bars obtained from the replicate analysis (Figures S17 and S18), we sought to verify our conclusion by pooling all chunk estimates and plotting histograms (Figure S19). We recover from those the effect of substrate binding on the E56 protonation state on both the OF and OCC states. However, as the reviewer has pointed out (something we did not discuss in our original manuscript), there is a shift in the pKa of E53 of the OF state only. In fact, the trend is also apparent in the replicate-based analysis of Figure 6, though here the larger error bars overlap. In our revision, we will add more details of these analyses for clarity (including more detailed figure captions regarding the data used in Figure 6) as well as a discussion of the partial effect on the E53 pKa value.

      We do not believe, however, that our key conclusions are negatively affected. If anything, a further effect on the E53 pKa which we had not previously commented on (since we saw the evidence as weaker, pertaining to only one conformational state) would strengthen the case for an involvement of the ExxER motif in ligand coupling.

      Reviewer #3 (Public Review):

      Summary:

      Lichtinger et al. have used an extensive set of molecular dynamics (MD) simulations to study the conformational dynamics and transport cycle of an important member of the proton-coupled oligopeptide transporters (POTs), namely SLC15A2 or PepT2. This protein is one of the most wellstudied mammalian POT transporters that provides a good model with enough insight and structural information to be studied computationally using advanced enhanced sampling methods employed in this work. The authors have used microsecond-level MD simulations, constant-PH MD, and alchemical binding free energy calculations along with cell-based transport assay measurements; however, the most important part of this work is the use of enhanced sampling techniques to study the conformational dynamics of PepT2 under different conditions.

      The study attempts to identify links between conformational dynamics and chemical events such as proton binding, ligand-protein interactions, and intramolecular interactions. The ultimate goal is of course to understand the proton-coupled peptide and drug transport by PepT2 and homologous transporters in the solute carrier family.

      Some of the key results include:

      (1) Protonation of H87 and D342 initiate the occluded (Occ) to the outward-facing (OF) state transition.

      (2) In the OF state, through engaging R57, substrate entry increases the pKa value of E56 and thermodynamically facilitates the movement of protons further down.

      (3) E622 is not only essential for peptide recognition but also its protonation facilitates substrate release and contributes to the intracellular gate opening. In addition, cell-based transport assays show that mutation of residues such as H87 and D342 significantly decreases transport activity as expected from simulations.

      Strengths:

      (1) This is an extensive MD-based study of PepT2, which is beyond the typical MD studies both in terms of the sheer volume of simulations as well as the advanced methodology used. The authors have not limited themselves to one approach and have appropriately combined equilibrium MD with alchemical free energy calculations, constant-pH MD, and geometry-based free energy calculations. Each of these 4 methods provides a unique insight regarding the transport mechanism of PepT2.

      (2) The authors have not limited themselves to computational work and have performed experiments as well. The cell-based transport assays clearly establish the importance of the residues that have been identified as significant contributors to the transport mechanism using simulations.

      (3) The conclusions made based on the simulations are mostly convincing and provide useful information regarding the proton pathway and the role of important residues in proton binding, protein-ligand interaction, and conformational changes.

      Weaknesses:

      (1) Some of the statements made in the manuscript are not convincing and do not abide by the standards that are mostly followed in the manuscript. For instance, on page 4, it is stated that "the K64-D317 interaction is formed in only ≈ 70% of MD frames and therefore is unlikely to contribute much to extracellular gate stability." I do not agree that 70% is negligible. Particularly, Figure S3 does not include the time series so it is not clear whether the 30% of the time where the salt bridge is broken is in the beginning or the end of simulations. For instance, it is likely that the salt bridge is not initially present and then it forms very strongly. Of course, this is just one possible scenario but the point is that Figure S3 does not rule out the possibility of a significant role for the K64-D317 salt bridge.

      The reviewer is right to point out that the statement and Figure S3 as they stand do not adequately support our decision to exclude the K64-D317 salt-bridge in our further investigations. The violin plot shown in Figure S3, visualised as pooled data from unbiased 1 μs triplicates, does indeed not rule out a scenario where the salt bridge only formed late in our simulations (or only in some replicates), but then is stable. Therefore, in our revision, we will include the appropriate time-series of the salt bridge distances, showing how K64-D317 is initially stable but then falls apart in replicate 1, and is transiently formed and disengaged across the trajectories in replicates 2 and 3. We will also remake the data for this plot as we discovered a bug in the relevant analysis script that meant the D170-K642 distance was not calculated accurately. The results are however almost identical, and our conclusions remain.

      (2) Similarly, on page 4, it is stated that "whether by protonation or mutation - the extracellular gate only opens spontaneously when both the H87 interaction network and D342-R206 are perturbed (Figure S5)." I do not agree with this assessment. The authors need to be aware of the limitations of this approach. Consider "WT H87-prot" and "D342A H87-prot": when D342 residue is mutated, in one out of 3 simulations, we see the opening of the gate within 1 us. When D342 residue is not mutated we do not see the opening in any of the 3 simulations within 1 us. It is quite likely that if rather than 3 we have 10 simulations or rather than 1 us we have 10 us simulations, the 0/3 to 1/3 changes significantly. I do not find this argument and conclusion compelling at all.

      If the conclusions were based on that alone, then we would agree. However, this section of work covers merely the observations of the initial unbiased simulations which we go on to test/explore with enhanced sampling in the rest of the paper, and which then lead us to the eventual conclusions.

      Figure S5 shows the results from triplicate 1 μs-long trajectories as violin-plot histograms of the extracellular gate opening distance, also indicating the first and final frames of the trajectories as connected by an arrow for orientation – a format we chose for intuitively comparing 48 trajectories in one plot. The reviewer reads the plot correctly when they analyse the “WT H87-prot” vs “D342A H87-prot” conditions. In the former case, no spontaneous opening in unbiased MD is taking place, whereas when D342 is mutated to alanine in addition to H87 protonation, we see spontaneous transition in 1 out of 3 replicates. However, the reviewer does not seem to interpret the statement in question in our paper (“the extracellular gate only opens spontaneously when both the H87 interaction network and D342-R206 are perturbed”) in the way we intended it to be understood. We merely want to note here a correlation in the unbiased dataset we collected at this stage, and indeed the one spontaneous opening in the case comparison picked out by the reviewer is in the condition where both the H87 interaction network and D342-R206 are perturbed. In noting this we do not intend to make statistically significant statements from the limited dataset. Instead, we write that “these simulations show a large amount of stochasticity and drawing clean conclusions from the data is difficult”. We do however stand by our assessment that from this limited data we can “already appreciate a possible mechanism where protons move down the transporter pore” – a hypothesis we investigate more rigorously with enhanced sampling in the rest of the paper. We will revise the section in question to make clearer that the unbiased MD is only meant to give an initial hypothesis here to be investigated in more detail in the following sections. In doing so, we will also incorporate, as we had not done before, the case (not picked out by the reviewer here but concerning the same figure) of S321A & H87 prot. In the third replicate, this shows partial gate opening towards the end of the unbiased trajectory (despite D342 not being affected), highlighting further the stochastic nature that makes even clear correlative conclusions difficult to draw.

      (3) While the MEMENTO methodology is novel and interesting, the method is presented as flawless in the manuscript, which is not true at all. It is stated on Page 5 with regards to the path generated by MEMENTO that "These paths are then by definition non-hysteretic." I think this is too big of a claim to say the paths generated by MEMENTO are non-hysteretic by definition. This claim is not even mentioned in the original MEMENTO paper. What is mentioned is that linear interpolation generates a hysteresis-free path by definition. There are two important problems here: (a) MEMENTO uses the linear interpolation as an initial step but modifies the intermediates significantly later so they are no longer linearly interpolated structures and thus the path is no longer hysteresisfree; (b) a more serious problem is the attribution of by-definition hysteresis-free features to the linearly interpolated states. This is based on conflating the hysteresis-free and unique concepts. The hysteresis in MD-based enhanced sampling is related to the presence of barriers in orthogonal space. For instance, one may use a non-linear interpolation of any type and get a unique pathway, which could be substantially different from the one coming from the linear interpolation. None of these paths will be hysteresis-free necessarily once subjected to MD-based enhanced sampling techniques.

      We certainly do not intend to claim that the MEMENTO method is flawless. The concern the reviewer raises around the statement "These paths are then by definition non-hysteretic" is perhaps best addressed by a clarification of the language used and considering how MEMENTO is applied in this work.

      Hysteresis in the most general sense denotes the dependence of a system on its history, or – more specifically – the lagging behind of the system state with regards to some physical driver (for example the external field in magnetism, whence the term originates). In the context of biased MD and enhanced sampling, hysteresis commonly denotes the phenomenon where a path created by a biased dynamics method along a certain collective variable lags behind in phase space in slow orthogonal degrees of freedom (see Figure 1 in Lichtinger and Biggin 2023, https://doi.org/10.1021/acs.jctc.3c00140). When used to generate free energy profiles, this can manifest as starting state bias, where the conformational state that was used to seed the biased dynamics appears lower in free energy than alternative states. Figure S6 shows this effect on the PepT2 system for both steered MD (heavy atom RMSD CV) + umbrella sampling (tip CV) and metadynamics (tip CV). There is, in essence, a coupled problem: without an appropriate CV (which we did not have to start with here), path generation that is required for enhanced sampling displays hysteresis, but the refinement of CVs is only feasible when paths connecting the true phase space basins of the two conformations are available. MEMENTO helps solve this issue by reconstructing protein conformations along morphing paths which perform much better than steered MD paths with respect to giving consistent free energy profiles (see Figure S7 and the validation cases in the MEMENTO paper), even if the same CV is used in umbrella sampling.

      There are still differences between replicates in those PMFs, indicating slow conformational flexibility propagated from end-state sampling through MEMENTO. We use this to refine the CVs further with dimensionality reduction (see the Method section and Figure S8), before moving to 2D-umbrella sampling (figure 3). Here, we think, the reviewer’s point seems to bear. The MEMENTO paths are ‘non-hysteretic by definition’ with respect to given end states in the sense that they connect (by definition) the correct conformations at both end-states (unlike steered MD), which in enhanced sampling manifests as the absence of the strong starting-state bias we had previously observed (Figure S7 vs S6). They are not, however, hysteresis-free with regards to how representative of the end-state conformational flexibility the structures given to MEMENTO really were, which is where the iterative CV design and combination of several MEMENTO paths in 2D-PMFs comes in.

      We also cannot make a direct claim about whether in the transition region the MEMENTO paths might be separated from the true (lower free energy) transition paths by slow orthogonal degrees of freedom, which may conceivably result in overestimated barrier heights separating two free energy basins. We cannot guarantee that this is not the case, but neither in our MEMENTO validation examples nor in this work have we encountered any indications of a problem here.

      We hope that the reviewer will be satisfied by our revision, where we will replace the wording in question by a statement that the MEMENTO paths do not suffer from hysteresis that is otherwise incurred as a consequence of not reaching the correct target state in the biased run (in some orthogonal degrees of freedom).

    1. Author Response

      Response to the Reviews

      We are grateful for these balanced, nuanced evaluations of our work concerning the observed epistatic trends and our interpretations of their mechanistic origins. Overall, we think the reviewers have done an excellent job at recognizing the novel aspects of our findings while also discussing the caveats associated with our interpretations of the biophysical effects of these mutations. We believe it is important to consider both of these aspects of our work in order to appreciate these advances and what sorts of pertinent questions remain.

      Notably, both reviewers suggest that a lack of experimental approaches to compare the conformational properties of GnRHR variants weakens our claims. We would first humbly suggest that this constitutes a more general caveat that applies to nearly all investigations of the cellular misfolding of α-helical membrane proteins. Whether or not any current in vitro folding measurements report on conformational transitions that are relevant to cellular protein misfolding reactions remains an active area of debate (discussed further below). Nevertheless, while we concede that our structural and/ or computational evaluations of various mutagenic effects remain speculative, prevailing knowledge on the mechanisms of membrane protein folding suggest our mutations of interest (V276T and W107A) are highly unlikely to promote misfolding in precisely the same way. Thus, regardless of whether or not we were able experimentally compare the relevant folding energetics of GnRHR variants, we are confident that the distinct epistatic interactions formed by these mutations reflect variations in the misfolding mechanism and that they are distinct from the interactions that are observed in the context of stable proteins. In the following, we provide detailed considerations concerning these caveats in relation to the reviewers’ specific comments.

      Reviewer #1 (Public Review):

      The paper carries out an impressive and exhaustive non-sense mutagenesis using deep mutational scanning (DMS) of the gonadotropin-releasing hormone receptor for the WT protein and two single point mutations that I) influence TM insertion (V267T) and ii) influence protein stability (W107A), and then measures the effect of these mutants on correct plasma membrane expression (PME).

      Overall, most mutations decreased mGnRHR PME levels in all three backgrounds, indicating poor mutational tolerance under these conditions. The W107A variant wasn't really recoverable with low levels of plasma membrane localisation. For the V267T variant, most additional mutations were more deleterious than WT based on correct trafficking, indicating a synergistic effect. As one might expect, there was a higher degree of positive correlation between V267T/W107A mutants and other mutants located in TM regions, confirming that improper trafficking was a likely consequence of membrane protein co-translational folding. Nevertheless, context is important, as positive synergistic mutants in the V27T could be negative in the W107A background and vice versa. Taken together, this important study highlights the complexity of membrane protein folding in dissecting the mechanism-dependent impact of disease-causing mutations related to improper trafficking.

      Strengths

      This is a novel and exhaustive approach to dissecting how receptor mutations under different mutational backgrounds related to co-translational folding, could influence membrane protein trafficking.

      Weaknesses

      The premise for the study requires an in-depth understanding of how the single-point mutations analysed affect membrane protein folding, but the single-point mutants used seem to lack proper validation.

      Given our limited understanding of the structural properties of misfolded membrane proteins, it is unclear whether the relevant conformational effects of these mutations can be unambiguously validated using current biochemical and/ or biophysical folding assays. X-ray crystallography, cryo-EM, and NMR spectroscopy measurements have demonstrated that many purified GPCRs retain native-like structural ensembles within certain detergent micelles, bicelles, and/ or nanodiscs. However, helical membrane protein folding measurements typically require titration with denaturing detergents to promote the formation of a denatured state ensemble (DSE), which will invariably retain considerable secondary structure. Given that the solvation provided by mixed micelles is clearly distinct from that of native membranes, it remains unclear whether these DSEs represent a reasonable proxy for the misfolded conformations recognized by cellular quality control (QC, see https://doi.org/10.1021/acs.chemrev.8b00532). Thus, the use and interpretation of these systems for such purposes remains contentious in the membrane protein folding community. In addition to this theoretical issue, we are unaware of any instances in which GPCRs have been found to undergo reversible denaturation in vitro- a practical requirement for equilibrium folding measurements (https://doi.org/10.1146/annurev-biophys-051013-022926). We note that, while the resistance of GPCRs to aggregation, proteolysis, and/ or mechanical unfolding have also been probed in micelles, it is again unclear whether the associated thermal, kinetic, and/ or mechanical stability should necessarily correspond to their resistance to cotranslational and/ or posttranslational misfolding. Thus, even if we had attempted to validate the computational folding predictions employed herein, we suspect that any resulting correlations with cellular expression may have justifiably been viewed by many as circumstantial. Simply put, we know very little about the non-native conformations are generally involved in the cellular misfolding of α-helical membrane proteins, much less how to measure their relative abundance. From a philosophical standpoint, we prefer to let cells tell us what sorts of broken protein variants are degraded by their QC systems, then do our best to surmise what this tells us about the relevant properties of cellular DSEs.

      Despite this fundamental caveat, we believe that the chosen mutations and our interpretation of their relevant conformational effects are reasonably well-informed by current modeling tools and by prevailing knowledge on the physicochemical drivers of membrane protein folding and misfolding. Specifically, the mechanistic constraints of translocon-mediated membrane integration provide an understanding of the types of mutations that are likely to disrupt cotranslational folding. Though we are still learning about the protein complexes that mediate membrane translocation (https://doi.org/10.1038/s41586-022-05336-2), it is known that this underlying process is fundamentally driven by the membrane depth-dependent amino acid transfer free energies (https://doi.org/10.1146/annurev.biophys.37.032807.125904). This energetic consideration suggests introducing polar side chains near the center of a nascent TMDs should almost invariably reduce the efficiency of topogenesis. To confirm this in the context of TMD6 specifically, we utilized a well-established biochemical reporter system to confirm that V276T attenuates its translocon-mediated membrane integration (Fig. S1)- at least in the context of a chimeric protein. We also constructed a glycosylation-based topology reporter for full-length GnRHR, but ultimately found its’ in vitro expression to be insufficient to detect changes in the nascent topological ensemble. In contrast to V276T, the W107A mutation is predicted to preserve the native topological energetics of GnRHR due to its position within a soluble loop region. W107A is also unlike V276T in that it clearly disrupts tertiary interactions that stabilize the native structure. This mutation should preclude the formation of a structurally conserved hydrogen bonding network that has been observed in the context of at least 25 native GPCR structures (https://doi.org/10.7554/eLife.5489). However, without a relevant folding assay, the extent to which this network stabilizes the native GnRHR fold in cellular membranes remains unclear. Overall, we admit that these limitations have prevented us from measuring how much V276T alters the efficiency of GnRHR topogenesis, how much the W107A destabilizes the native fold, or vice versa. Nevertheless, given these design principles and the fact that both reduce the plasma membrane expression of GnRHR, as expected, we are highly confident that the structural defects generated by these mutations do, in fact, promote misfolding in their own ways. We also concede that the degree to which these mutagenic perturbations are indeed selective for specific folding processes is somewhat uncertain. However, it seems exceedingly unlikely that these mutations should disrupt topogenesis and/ or the folding of the native topomer to the exact same extent. From our perspective, this is the most important consideration with respect to the validity of the conclusions we have made in this manuscript.

      Furthermore, plasma membrane expression has been used as a proxy for incorrect membrane protein folding, but this not necessarily be the case, as even correctly folded membrane proteins may not be trafficked correctly, at least, under heterologous expression conditions. In addition, mutations can affect trafficking and potential post-translational modifications, like glycosylation.

      While the reviewer is correct that the sorting of folded proteins within the secretory pathway is generally inefficient, it is also true that the maturation of nascent proteins within the ER generally bottlenecks the plasma membrane expression of most α-helical membrane proteins. Our group and several others have demonstrated that the efficiency of ER export generally appears to scale with the propensity of membrane proteins to achieve their correct topology and/ or to achieve their native fold (see https://doi.org/10.1021/jacs.5b03743 and https://doi.org/10.1021/jacs.8b08243). Notably, these investigations all involved proteins that contain native glycosylation and various other post-translational modification sites. While we cannot rule out that certain specific combinations of mutations may alter expression through their perturbation of post-translational GnRHR modifications, we feel confident that the general trends we have observed across hundreds of variants predominantly reflect changes in folding and cellular QC. This interpretation is supported by the relationship between observed trends in variant expression and Rosetta-based stability calculations, which we identified using unbiased unsupervised machine learning approaches (compare Figs. 6B & 6D).

      Reviewer #2 (Public Review):

      Summary:

      In this paper, Chamness and colleagues make a pioneering effort to map epistatic interactions among mutations in a membrane protein. They introduce thousands of mutations to the mouse GnRH Receptor (GnRHR), either under wild-type background or two mutant backgrounds, representing mutations that destabilize GnRHR by distinct mechanisms. The first mutant background is W107A, destabilizing the tertiary fold, and the second, V276T, perturbing the efficiency of cotranslational insertion of TM6 to the membrane, which is essential for proper folding. They then measure the surface expression of these three mutant libraries, using it as a proxy for protein stability, since misfolded proteins do not typically make it to the plasma membrane. The resulting dataset is then used to shed light on how diverse mutations interact epistatically with the two genetic background mutations. Their main conclusion is that epistatic interactions vary depending on the degree of destabilization and the mechanism through which they perturb the protein. The mutation V276T forms primarily negative (aggravating) epistatic interactions with many mutations, as is common to destabilizing mutations in soluble proteins. Surprisingly, W107A forms many positive (alleviating) epistatic interactions with other mutations. They further show that the locations of secondary mutations correlate with the types of epistatic interactions they form with the above two mutants.

      Strengths:

      Such a high throughput study for epistasis in membrane proteins is pioneering, and the results are indeed illuminating. Examples of interesting findings are that: (1) No single mutation can dramatically rescue the destabilization introduced by W107A. (2) Epistasis with a secondary mutation is strongly influenced by the degree of destabilization introduced by the primary mutation. (3) Misfolding caused by mis-insertion tends to be aggravated by further mutations. The discussion of how protein folding energetics affects epistasis (Fig. 7) makes a lot of sense and lays out an interesting biophysical framework for the findings.

      Weaknesses:

      The major weakness comes from the potential limitations in the measurements of surface expression of severely misfolded mutants. This point is discussed quite fairly in the paper, in statements like "the W107A variant already exhibits marginal surface immunostaining" and many others. It seems that only about 5% of the W107A makes it to the plasma membrane compared to wild-type (Figures 2 and 3). This might be a low starting point from which to accurately measure the effects of secondary mutations.

      The reviewer raises an excellent point that we considered at length during the analysis of these data and the preparation of the manuscript. Though we remain confident in the integrity of these measurements and the corresponding analyses, we now realize this aspect of the data merits further discussion and documentation in our forthcoming revision, in which we will outline the following specific lines of reasoning.

      Still, the authors claim that measurements of W107A double mutants "still contain cellular subpopulations with surface immunostaining intensities that are well above or below that of the W107A single mutant, which suggests that this fluorescence signal is sensitive enough to detect subtle differences in the PME of these variants". I was not entirely convinced that this was true.

      We made this statement based on the simple observation that the surface immunostaining intensities across the population of recombinant cells expressing the library of W107A double mutants was consistently broader than that of recombinant cells expressing W107A GnRHR alone (see Author response image 1 for reference). Given that the recombinant cellular library represents a mix of cells expressing ~1600 individual variants that are each present at low abundance, the pronounced tails within this distribution presumably represent the composite staining of many small cellular subpopulations that express collections of variants that deviate from the expression of W107A to an extent that is significant enough to be visible on a log intensity plot.

      Author response image 1.

      Firstly, I think it would be important to test how much noise these measurements have and how much surface immunostaining the W107A mutant displays above the background of cells that do not express the protein at all.

      For reference, the average surface immunostaining intensity of HEK293T cells transiently expressing W107A GnRHR was 2.2-fold higher than that of the IRES-eGFP negative, untransfected cells within the same sample- the WT immunostaining intensity was 9.5-fold over background by comparison. Similarly, recombinant HEK293T cells expressing the W107A double mutant library had an average surface immunostaining intensity that was 2.6-fold over background across the two DMS trials. Thus, while the surface immunostaining of this variant is certainly diminished, we were still able to reliably detect W107A at the plasma membrane even under distinct expression regimes. We will include these and other signal-to-noise metrics for each experiment in a new table in the revised version of this manuscript.

      Beyond considerations related to intensity, we also previously noticed the relative intensity values for W107A double mutants exhibited considerable precision across our two biological replicates. If signal were too poor to detect changes in variant expression, we would have expected a plot of the intensity values across these two replicates to form a scatter. Instead, we found DMS intensity values for individual variants to be highly correlated from one replicate to the next (Pearson’s R= 0.97, see Author response image 2 for reference). This observation empirically demonstrates that this assay consistently differentiated between variants that exhibit slightly enhanced immunostaining from those that have even lower immunostaining than W107A GnRHR.

      Author response image 2.

      But more importantly, it is not clear if under this regimen surface expression still reports on stability/protein fitness. It is unknown if the W107A retains any function or folding at all. For example, it is possible that the low amount of surface protein represents misfolded receptors that escaped the ER quality control.

      While we believe that such questions are outside the scope of this work, we certainly agree that it is entirely possible that some of these variants bypass QC without achieving their native fold. This topic is quite interesting to us but is quite challenging to assess in the context of GPCRs, which have complex fitness landscapes that involve their propensity to distinguish between different ligands, engage specific components associated with divergent downstream signaling pathways, and navigate between endocytic recycling/ degradation pathways following activation. In light of the inherent complexity of GPCR function, we humbly suggest our choice of a relatively simple property of an otherwise complex protein may be viewed as a virtue rather than a shortcoming. Protein fitness is typically cast as the product of abundance and activity. Rather than measuring an oversimplified, composite fitness metric, we focused on one variable (plasma membrane expression) and its dominant effector (folding). We believe restraining the scope in this manner was key for the elucidation of clear mechanistic insights.

      The differential clustering of epistatic mutations (Fig. 6) provides some interesting insights as to the rules that dictate epistasis, but these too are dominated by the magnitude of destabilization caused by one of the mutations. In this case, the secondary mutations that had the most interesting epistasis were exceedingly destabilizing. With this in mind, it is hard to interpret the results that emerge regarding the epistatic interactions of W107A. Furthermore, the most significant positive epistasis is observed when W107A is combined with additional mutations that almost completely abolish surface expression. It is likely that either mutation destabilizes the protein beyond repair. Therefore, what we can learn from the fact that such mutations have positive epistasis is not clear to me. Based on this, I am not sure that another mutation that disrupts the tertiary folding more mildly would not yield different results. With that said, I believe that the results regarding the epistasis of V276T with other mutations are strong and very interesting on their own.

      We agree with the reviewer. In light of our results we believe it is virtually certain that the secondary mutations characterized herein would be likely to form distinct epistatic interactions with mutations that are only mildly destabilizing. Indeed, this insight reflects one of the key takeaway messages from this work- stability-mediated epistasis is difficult to generalize because it should depend on the extent to which each mutation changes the stability (ΔΔG) as well as initial stability of the WT/ reference sequence (ΔG, see Figure 7). Frankly, we are not so sure we would have pieced this together as clearly had we not had the fortune (or misfortune?) of including such a destructive mutation like W107A as a point of reference.

      Additionally, the study draws general conclusions from the characterization of only two mutations, W107A and V276T. At this point, it is hard to know if other mutations that perturb insertion or tertiary folding would behave similarly. This should be emphasized in the text.

      We agree and will be sure to emphasize this point in the revised manuscript.

      Some statistical aspects of the study could be improved:

      1. It would be nice to see the level of reproducibility of the biological replicates in a plot, such as scatter or similar, with correlation values that give a sense of the noise level of the measurements. This should be done before filtering out the inconsistent data.

      We thank the reviewer for this suggestion and will include scatters for each genetic background like the one shown above in the supplement of the revised version of the manuscript.

      1. The statements "Variants bearing mutations within the C- terminal region (ICL3-TMD6-ECL3-TMD7) fare consistently worse in the V276T background relative to WT (Fig. 4 B & E)." and "In contrast, mutations that are 210 better tolerated in the context of W107A mGnRHR are located 211 throughout the structure but are particularly abundant among residues 212 in the middle of the primary structure that form TMD4, ICL2, and ECL2 213 (Fig. 4 C & F)." are both hard to judge. Inspecting Figures 4B and C does not immediately show these trends, and importantly, a solid statistical test is missing here. In Figures 4E and F the locations of the different loops and TMs are not indicated on the structure, making these statements hard to judge.

      We apologize for this oversight and thank the reviewer for pointing this out. We will include additional statistical tests to reinforce these conclusions in the revised version of the manuscript.

      1. The following statement lacks a statistical test: "Notably, these 98 variants are enriched with TMD variants (65% TMD) relative to the overall set of 251 variants (45% TMD)." Is this enrichment significant? Further in the same paragraph, the claim that "In contrast to the sparse epistasis that is generally observed between mutations within soluble proteins, these findings suggest a relatively large proportion of random mutations form epistatic interactions in the context of unstable mGnRHR variants". Needs to be backed by relevant data and statistics, or at least a reference.

      We will include additional statistical tests for this in the revised manuscript and will ensure the language we use is consistent with the strength of the indicated statistical enrichment.

    1. Author response:

      Reviewer #1 (Public review):

      This work regards the role of Aurora Kinase A (AurA) in trained immunity. The authors claim that AurA is essential to the induction of trained immunity. The paper starts with a series of experiments showing the effects of suppressing AurA on beta-glucan-trained immunity. This is followed by an account of how AurA inhibition changes the epigenetic and metabolic reprogramming that are characteristic of trained immunity. The authors then zoom in on specific metabolic and epigenetic processes (regulation of S-adenosylmethionine metabolism & histone methylation). Finally, an inhibitor of AurA is used to reduce beta-glucan's anti-tumour effects in a subcutaneous MC-38 model.

      Strengths:

      With the exception of my confusion around the methods used for relative gene expression measurements, the experimental methods are generally well-described. I appreciate the authors' broad approach to studying different key aspects of trained immunity (from comprehensive transcriptome/chromatin accessibility measurements to detailed mechanistic experiments). Approaching the hypothesis from many different angles inspires confidence in the results (although not completely - see weaknesses section). Furthermore, the large drug-screening panel is a valuable tool as these drugs are readily available for translational drug-repurposing research.

      We thank the reviewer for the positive and encouraging comments.

      Weaknesses:

      (1) The manuscript contains factual inaccuracies such as: (a) Intro: the claim that trained cells display a shift from OXPHOS to glycolysis based on the paper by Cheng et al. in 2014; this was later shown to be dependent on the dose of stimulation and actually both glycolysis and OXPHOS are generally upregulated in trained cells (pmid 32320649).

      We appreciate the reviewer for pointing out this inaccuracy, and we will revise our statement to ensure accurate and updated description. We are aware that trained immunity involves different metabolic pathways, including both glycolysis and oxidative phosphorylation[1, 2]. We also detected Oxygen Consumption Rate (OCR, as detailed in comment#8) but observed no increase of oxygen consumption in trained BMDMs while previous study reported decreased oxidative phosphorylation[3]. We will discuss the potential reasons underlying such different results.

      (b) Discussion: Trained immunity was first described as such in 2011, not decades ago.

      We are sorry for the inaccurate description, and we will correct the statement in our revised manuscript as “Despite the fact that the concept of “trained immunity” has been proposed since 2011, the mechanisms that regulate trained immunity are still not completely understood.”

      (2) The authors approach their hypothesis from different angles, which inspires a degree of confidence in the results. However, the statistical methods and reporting are underwhelming.

      (a) Graphs depict mean +/- SEM, whereas mean +/- SD is almost always more informative. (b) The use of 1-tailed tests is dubious in this scenario. Furthermore, in many experiments/figures the case could be made that the comparisons should be considered paired (the responses of cells from the same animal are inherently not independent due to their shared genetic background and, up until cell isolation, the same host factors like serum composition/microbiome/systemic inflammation etc). (c) It could be explained a little more clearly how multiple testing correction was done and why specific tests were chosen in each instance.

      Thank you for the suggestions and we will revise all data presented as mean ± SEM in the manuscript to mean ± SD, and provide a detailed description of how multiple comparisons were performed and explain the rationale behind the different comparison methods used. Previous studies have shown that knockdown of GNMT increases intracellular SAM level and knockdown of GNMT is commonly used as a method to upregulate SAM[4-6]. Thus we used 1-tailed test in Figure 3J.

      (d) Most experiments are done with n = 3, some experiments are done with n = 5. This is not a lot. While I don't think power analyses should be required for simple in vitro experiments, I would be wary of drawing conclusions based on n = 3. It is also not indicated if the data points were acquired in independent experiments. ATAC-seq/RNA-seq was, judging by the figures, done on only 2 mice per group. No power calculations were done for the in vivo tumor model.

      We are sorry for the confusion in our description in figure legends. As for in vitro studies, we performed at least three independent experiments (BMs isolated from different mice) but we only display technical replicates data from one experiment in our manuscript. As for seq data, we acknowledge the reviewer's concern regarding the small sample size (n=2) in our RNA-seq/ATAC-seq experiment. We consider the sequencing experiment mainly as an exploratory approach, and performed rigorous quality control and normalization of the sequencing data to ensure the reliability of our findings. While we understand that a larger sample size would be ideal for drawing more definitive conclusions, we believe that the current data offer valuable preliminary insights that can inform future studies with larger cohorts. As a complementary method, we conducted ChIP PCR for detecting active histone modification enrichment in Il6 and Tnf region to further verify the increased accessibility of trained immunity induced inflammatory genes and reliability of our conclusions despite the small sample size. We hope this clarifies our approach, and we would be happy to further acknowledge and discuss the limitations of the current study.

      For the in vivo experiment, we determined the sample size by referring to the animal numbers used for similar experiments in literatures. And according to a reported resource equation approach for calculating sample size in animal studies[7], n=5-7 is suitable for most of our mouse experiments. We will describe the details in the revised methods part.

      (e) Furthermore, the data spread in many experiments (particularly BMDM experiments) is extremely small. I wonder if these are true biological replicates, meaning each point represents BMDMs from a different animal? (disclaimer: I work with human materials where the spread is of course always much larger than in animal experiments, so I might be misjudging this.).

      We are sorry for the confusion in our description in figure legends. In vivo experiments represent individual mice as biological replicates, the exact values of n are reported in figure legends and each point represents data from a different animal (Figure 1I, and Figure 6). The in vitro cell assay was performed in triplicates, each experiment was independently replicated at least three times and points represents technical replicates.

      (3) Maybe the authors are reserving this for a separate paper, but it would be fantastic if the authors would report the outcomes of the entire drug screening instead of only a selected few. The field would benefit from this as it would save needless repeat experiments. The list of drugs contains several known inhibitors of training (e.g. mTOR inhibitors) so there must have been more 'hits' than the reported 8 Aurora inhibitors.

      Thank you for your suggestion and we will report the outcomes of the entire drug screening in the revised manuscript.

      (4) Relating to the drug screen and subsequent experiments: it is unclear to me in supplementary figure 1B which concentrations belong to secondary screens #1/#2 - the methods mention 5 µM for the primary screen and "0.2 and 1 µM" for secondary screens, is it in this order or in order of descending concentration?

      Thank you for your comments and we are sorry for unclear labelled results in supplementary 1B. We performed secondary drug screen at two concentrations, and drug concentrations corresponding to secondary screen#1 and #2 are 0.2, 1 μM respectively. That is to say, it is just in this order, not in an order of descending concentration.

      (a) It is unclear if the drug screen was performed with technical replicates or not - the supplementary figure 1B suggests no replicates and quite a large spread (in some cases lower concentration works better?)

      Thank you for your question. The drug screen was performed without technical replicates. Actually, we observed s a lower concentration works better in some cases. This might be due to the fact that the drug's effect correlates positively with its concentration only within a specific range (as seen in comment#4).

      (5) The methods for (presumably) qPCR for measuring gene expression in Figure 1C are missing. Which reference gene was used and is this a suitably stable gene?

      We are sorry for the omission for the qPCR method. The mRNA expression of Il6 and Tnf in trained BMDMs was normalized to untrained BMDMs and β-actin served as a reference gene. And we will describe in detail in our revised manuscript.

      (6) From the complete unedited blot image of Figure 1D it appears that the p-Aurora and total Aurora are not from the same gel (discordant number of lanes and positioning). This could be alright if there are no/only slight technical errors, but I find it misleading as it is presented as if the actin (loading control to account for aforementioned technical errors!) counts for the entire figure.

      Thanks for this comment. In the original data, p-Aurora and total Aurora were from different gels. In this experiment the membrane stripping/reprobing after p-Aurora antibody did now work well, so we couldn’t get all results from one gel, and we had to run another gel using the same samples to blot with anti-aurora antibody. Yes we should have provided separated actin blots as loading controls for this experiment. We will repeat the experiment and provide original data of three biological replicates to confirm the experiment result.

      Figure 2: This figure highlights results that are by far not the strongest ones - I think the 'top hits' deserve some more glory. A small explanation on why the highlighted results were selected would have been fitting.

      We appreciate the valuable suggestion. We will make a discussion in our revised manuscript.

      (7) Figure 3 incl supplement: the carbon tracing experiments show more glucose-carbon going into TCA cycle (suggesting upregulated oxidative metabolism), but no mito stress test was performed on the seahorse.

      We appreciate this question raised by the reviewer. We previously performed seahorse XF analyze to measure mito stress in β-glucan trained BMDMs in combination with alisertib (data not shown in our submitted manuscript). The results showed no increase in oxidative phosphorylation under β-glucan stimulation.

      Author response image 1.

      (8) Inconsistent use of an 'alisertib-alone' control in addition to 'medium', 'b-glucan', 'b-glucan + alisertib'. This control would be of great added value in many cases, in my opinion.

      Thank you for your comment. We appreciate that including “alisertib-alone” group throughout all the experiments may add more value to the findings. We set the aim of the current study to investigate the role of Aurora kinase A in trained immunity. Therefore, in most settings, we did not focus on the role of aurora kinase A without β-glucan stimulation. Initially, we showed in Figure 1B and 1C that alisertib alone in a concentration lower than 1μM (included) does not affect the response to secondary stimulus. In a previous report, the authors showed that Aurora A inhibitor alone did not affect trained immunity[8]. Thus, we did not include this control group in all of the experiments.

      (9) Figure 4A: looking at the unedited blot images, the blot for H3K36me3 appears in its original orientation, whereas other images appear horizontally mirrored. Please note, I don't think there is any malicious intent but this is quite sloppy and the authors should explain why/how this happened (are they different gels and the loading sequence was reversed?)

      Thank you for pointing out this error. After checking the original data, we found that we indeed misassembled the orientation of several blots. We went through the assembling process and figured out that some orientations were assembled according to the loading sequences but not saved, so that the orientations in Figure 4A were not consistent with the unedited blot image. We are sorry for the careless mistake, and we will double check to make sure all the blots are correctly assembled in the revised manuscript.

      (10) For many figures, for example prominently figure 5, the text describes 'beta-glucan training' whereas the figures actually depict acute stimulation with beta-glucan. While this is partially a semantic issue (technically, the stimulation is 'the training-phase' of the experiment), this could confuse the reader.

      Thanks for the reviewer’s suggestion and we will reorganize our language to ensure clarity and avoid any inconsistencies that might lead to misunderstanding.

      (11) Figure 6: Cytokines, especially IL-6 and IL-1β, can be excreted by tumour cells and have pro-tumoral functions. This is not likely in the context of the other results in this case, but since there is flow cytometry data from the tumour material it would have been nice to see also intracellular cytokine staining to pinpoint the source of these cytokines.

      Thanks for the reviewer’s suggestion. To address potential concerns raised by the reviewers, we will perform intracellular cytokines staining in tumor experiments with mice trained with β-glucan or in combination with alisertib followed MC38 inoculation.

      Reviewer #2 (Public review):

      Summary:

      This manuscript investigates the inhibition of Aurora A and its impact on β-glucan-induced trained immunity via the FOXO3/GNMT pathway. The study demonstrates that inhibition of Aurora A leads to overconsumption of SAM, which subsequently impairs the epigenetic reprogramming of H3K4me3 and H3K36me3, effectively abolishing the training effect.

      Strengths:

      The authors identify the role of Aurora A through small molecule screening and validation using a variety of molecular and biochemical approaches. Overall, the findings are interesting and shed light on the previously underexplored role of Aurora A in the induction of β-glucan-driven epigenetic change.

      We thank the reviewer for the positive and encouraging comments.

      Weaknesses:

      Given the established role of histone methylations, such as H3K4me3, in trained immunity, it is not surprising that depletion of the methyl donor SAM impairs the training response. Nonetheless, this study provides solid evidence supporting the role of Aurora A in β-glucan-induced trained immunity in murine macrophages. The part of in vivo trained immunity antitumor effect is insufficient to support the final claim as using Alisertib could inhibits Aurora A other cell types other than myeloid cells.

      We appreciate the question raised by the reviewer. Though SAM generally acts as methyl donor, whether the epigenetic reprogram in trained immunity is directly linked to SAM metabolism is not known. In our study, we provided evidence suggesting the necessity of SAM maintenance in supporting trained immunity. As for in vivo tumor model, tumor cells were subcutaneously inoculated 24 h after oral administration of alisertib. Previous studies showed alisertib administered orally had a half-life of 10 h and 90% concentration reduction in serum after 24 h [9, 10]. Therefore, we suppose that tumor cells are more susceptible to long-term effects of drugs on the immune system rather than directly affected by alisertib. To further address the reviewer’s concern, we will perform bone marrow transplantation (trained mice as donor and naïve mice as recipient) to clarify the mechanistic contribution of trained immunity versus off-target effects.

      Cited references

      (1) Ferreira, A.V., et al., Metabolic Regulation in the Induction of Trained Immunity. Semin Immunopathol, 2024. 46(3-4): p. 7.

      (2) Keating, S.T., et al., Rewiring of glucose metabolism defines trained immunity induced by oxidized low-density lipoprotein. J Mol Med (Berl), 2020. 98(6): p. 819-831.

      (3) Li, X., et al., Maladaptive innate immune training of myelopoiesis links inflammatory comorbidities. Cell, 2022. 185(10): p. 1709-1727.e18.

      (4) Luka, Z., S.H. Mudd, and C. Wagner, Glycine N-methyltransferase and regulation of S-adenosylmethionine levels. J Biol Chem, 2009. 284(34): p. 22507-11.

      (5) Hughey, C.C., et al., Glycine N-methyltransferase deletion in mice diverts carbon flux from gluconeogenesis to pathways that utilize excess methionine cycle intermediates. J Biol Chem, 2018. 293(30): p. 11944-11954.

      (6) Simile, M.M., et al., Nuclear localization dictates hepatocarcinogenesis suppression by glycine N-methyltransferase. Transl Oncol, 2022. 15(1): p. 101239.

      (7) Arifin, W.N. and W.M. Zahiruddin, Sample Size Calculation in Animal Studies Using Resource Equation Approach. Malays J Med Sci, 2017. 24(5): p. 101-105.

      (8) Benjaskulluecha, S., et al., Screening of compounds to identify novel epigenetic regulatory factors that affect innate immune memory in macrophages. Sci Rep, 2022. 12(1): p. 1912.

      (9) Yang, J.J., et al., Preclinical drug metabolism and pharmacokinetics, and prediction of human pharmacokinetics and efficacious dose of the investigational Aurora A kinase inhibitor alisertib (MLN8237). Drug Metab Lett, 2014. 7(2): p. 96-104.

      (10) Palani, S., et al., Preclinical pharmacokinetic/pharmacodynamic/efficacy relationships for alisertib, an investigational small-molecule inhibitor of Aurora A kinase. Cancer Chemother Pharmacol, 2013. 72(6): p. 1255-64.

    1. Author response:

      Point-by-point description of the revisions

      Reviewer #1 (Evidence, reproducibility, and clarity):

      The work by Pinon et al describes the generation of a microvascular model to study Neisseria meningitidis interactions with blood vessels. The model uses a novel and relatively high throughput fabrication method that allows full control over the geometry of the vessels. The model is well characterized. The authors then study different aspects of Neisseriaendothelial interactions and benchmark the bacterial infection model against the best disease model available, a human skin xenograft mouse model, which is one of the great strengths of the paper. The authors show that Neisseria binds to the 3D model in a similar geometry that in the animal xenograft model, induces an increase in permeability short after bacterial perfusion, and induces endothelial cytoskeleton rearrangements. Finally, the authors show neutrophil recruitment to bacterial microcolonies and phagocytosis of Neisseria. The article is overall well written, and it is a great advancement in the bioengineering and sepsis infection field, and I only have a few major comments and some minor.

      Major comments:

      Infection-on-chip. I would recommend the authors to change the terminology of "infection on chip" to better reflect their work. The term is vague and it decreases novelty, as there are multiple infection on chips models that recapitulate other infections (recently reviewed in https://doi.org/10.1038/s41564-024-01645-6) including Ebola, SARS-CoV-2, Plasmodium and Candida. Maybe the term "sepsis on chip" would be more specific and exemplify better the work and novelty. Also, I would suggest that the authors carefully take a look at the text and consider when they use VoC or to current term IoC, as of now sometimes they are used interchangeably, with VoC being used occasionally in bacteria perfused experiments.

      We thank Reviewer #1 for this suggestion. Indeed, we have chosen to replace the term "Infection-on-Chip" by "infected Vessel-on-chip" to avoid any confusion in the title and the text. Also, we have removed all the terms "IoC" which referred to "Infection-on-Chip" and replaced with "VoC" for "Vessel-on-Chip". We think these terms will improve the clarity of the main text.

      Author response image 1.

      F-actin (red) and ezrin (yellow) staining after 3h of infection with N. meningitidis (green) in 2D (top) and 3D (bottom) vessel-on-chip models.

      Fig 3 and Supplementary 3: Permeability. The authors suggest that early 3h infection with Neisseria do not show increase in vascular permeability in the animal model, contrary to their findings in the 3D in vitro model. However, they show a non-significant increase in permeability of 70 KDa Dextran in the animal xenograft early infection. This seems to point that if the experiment would have been done with a lower molecular weight tracer, significant increases in permeability could have been detected. I would suggest to do this experiment that could capture early events in vascular disruption.

      Comparing permeability under healthy and infected conditions using Dextran smaller than 70 kDa is challenging. Previous research (1) has shown that molecules below 70 kDa already diffuse freely in healthy tissue. Given this high baseline diffusion, we believe that no significant difference would be observed before and after N. meningitidis infection and these experiments were not carried out. As discussed in the manuscript, bacteria induced permeability in mouse occurs at later time points, 16h post infection as shown previoulsy (2). As discussed in the manuscript, this difference between the xenograft model and the chip likely reflect the absence in the chip of various cell types present in the tissue parenchyma.

      The authors show the formation of actin of a honeycomb structure beneath the bacterial microcolonies. This only occurred in 65% of the microcolonies. Is this result similar to in vitro 2D endothelial cultures in static and under flow? Also, the group has shown in the past positive staining of other cytoskeletal proteins, such as ezrin in the ERM complex. Does this also occur in the 3D system?

      We thank the Reviewer #1 for this suggestion.

      • According to this recommendation, we imaged monolayers of endothelial cells in the flat regions of the chip (the two lateral channels) using the same microscopy conditions (i.e., Obj. 40X N.A. 1.05) that have been used to detect honeycomb structures in the 3D vessels in vitro. We showed that more than 56% of infected cells present these honeycomb structures in 2D, which is 13% less than in 3D, and is not significant due to the distributions of both populations. Thus, we conclude that under both in vitro conditions, 2D and 3D, the amount of infected cells exhibiting cortical plaques is similar. We have added the graph and the confocal images in Figure S4B and lines 418-419 of the revised manuscript.

      • We recently performed staining of ezrin in the chip and imaged both the 3D and 2D regions. Although ezrin staining was visible in 3D (Fig. 1 of this response), it was not as obvious as other markers under these infected conditions and we did not include it in the main text. Interpretation of this result is not straight forward as for instance the substrate of the cells is different and it would require further studies on the behaviour of ERM proteins in these different contexts.

      One of the most novel things of the manuscript is the use of a relatively quick photoablation system. I would suggest that the authors add a more extensive description of the protocol in methods. Could this technique be applied in other laboratories? If this is a major limitation, it should be listed in the discussion.

      Following the Reviewer’s comment, we introduced more detailed explanations regarding the photoablation:

      • L157-163 (Results): "Briefly, the chosen design is digitalized into a list of positions to ablate. A pulsed UV-LASER beam is injected into the microscope and shaped to cover the back aperture of the objective. The laser is then focused on each position that needs ablation. After introducing endothelial cells (HUVEC) in the carved regions,…"

      • L512-516 (Discussion): "The speed capabilities drastically improve with the pulsing repetition rate. Given that our laser source emits pulses at 10kHz, as compared to other photoablation lasers with repetitions around 100 Hz, our solution could potentially gain a factor of 100."

      • L1082-1087 (Materials and Methods): "…, and imported in a python code. The control of the various elements is embedded and checked for this specific set of hardware. The code is available upon request." Adding these three paragraphs gives more details on how photoablation works thus improving the manuscript.

      Minor comments:

      Supplementary Fig 2. The reference to subpanels H and I is swapped.

      The references to subpanels H and I have been correctly swapped back in the reviewed version.

      Line 203: I would suggest to delete this sentence. Although a strength of the submitted paper is the direct comparison of the VoC model with the animal model to better replicate Neisseria infection, a direct comparison with animal permeability is not needed in all vascular engineering papers, as vascular permeability measurements in animals have been well established in the past.

      The sentence "While previously developed VoC platforms aimed at replicating physiological permeability properties, they often lack direct comparisons with in vivo values." has been removed from the revised text.

      Fig 3: Bacteria binding experiments. I would suggest the addition of more methodological information in the main results text to guarantee a good interpretation of the experiment. First, it would be better that wall shear stress rather than flow rate is described in the main text, as flow rate is dependent on the geometry of the vessel being used. Second, how long was the perfusion of Neisseria in the binding experiment performed to quantify colony doubling or elongation? As per figure 1C, I would guess than 100 min, but it would be better if this information is directly given to the readers.

      We thank Reviewer #1 for these two suggestions that will improve the text clarity (e.g., L316). (i) Indeed, we have changed the flow rate in terms of shear stress. (ii) Also, we have normalized the quantification of the colony doubling time according to the first time-point where a single bacteria is attached to the vessel wall. Thus, early adhesion bacteria will be defined by a longer curve while late adhesion bacteria by a shorter curve. In total, the experiment lasted for 3 hours (modifications appear in L318 and L321-326).

      Fig 4: The honeycomb structure is not visible in the 3D rendering of panel D. I would recommend to show the actin staining in the absence of Neisseria staining as well.

      According to this suggestion, a zoom of the 3D rendering of the cortical plaque without colony had been added to the figure 4 of the revised manuscript.

      Line 421: E-selectin is referred as CD62E in this sentence. I would suggest to use the same terminology everywhere.

      We have replaced the "CD62E" term with "E-selectin" to improve clarity.

      Line 508: "This difference is most likely associated with the presence of other cell types in the in vivo tissues and the onset of intravascular coagulation". Do the authors refer to the presence of perivascular cells, pericytes or fibroblasts? If so, it could be good to mention them, as well as those future iterations of the model could include the presence of these cell types.

      By "other cell types", we refer to pericytes (3), fibroblasts (4), and perivascular macrophages (5), which surround endothelial cells and contribute to vessel stability. The main text was modified to include this information (Lines 548 and 555-570) and their potential roles during infection disussed.

      Discussion: The discussion covers very well the advantages of the model over in vitro 2D endothelial models and the animal xenograft but fails to include limitations. This would include the choice of HUVEC cells, an umbilical vein cell line to study microcirculation, the lack of perivascular cells or limitations on the fabrication technique regarding application in other labs (if any).

      We thank Reviewer #1 for this suggestion. Indeed, our manuscript may lack explaining limitations, and adding them to the text will help improve it:

      • The perspectives of our model include introducing perivascular cells surrounding the vessel and fibroblasts into the collagen gel as discussed previously and added in the discussion part (L555-570).

      • Our choice for HUVEC cells focused on recapitulating the characteristics of venules that respect key features such as the overexpression of CD62E and adhesion of neutrophils during inflammation. Using microvascular endothelial cells originating from different tissues would be very interesting. This possibility is now mentioned in the discussion lines 567-568.

      • Photoablation is a homemade fabrication technique that can be implemented in any lab harboring an epifluorescence microscope. This method has been more detailed in the revised manuscript (L1085-1087).

      Line 576: The authors state that the model could be applied to other systemic infections but failed to mention that some infections have already been modelled in 3D bioengineered vascular models (examples found in https://doi.org/10.1038/s41564-024-01645-6). This includes a capillary photoablated vascular model to study malaria (DOI: 10.1126/sciadv.aay724).

      Thes two important references have been introduced in the main text (L84, 647, 648).

      Line 1213: Are the 6M neutrophil solution in 10ul under flow. Also, I would suggest to rewrite this sentence in the following line "After, the flow has been then added to the system at 0.7-1 µl/min."

      We now specified that neutrophils are circulated in the chip under flow conditions, lines 1321-1322.

      Significance

      The manuscript is comprehensive, complete and represents the first bioengineered model of sepsis. One of the major strengths is the carful characterization and benchmarking against the animal xenograft model. Its main limitations is the brief description of the photoablation methodology and more clarity is needed in the description of bacteria perfusion experiments, given their complexity. The manuscript will be of interest for the general infection community and to the tissue engineering community if more details on fabrication methods are included. My expertise is on infection bioengineered models.

      Reviewer #2 (Evidence, reproducibility, and clarity):

      Summary:

      The authors develop a Vessel-on-Chip model, which has geometrical and physical properties similar to the murine vessels used in the study of systemic infections. The vessel was created via highly controllable laser photoablation in a collagen matrix, subsequent seeding of human endothelial cells and flow perfusion to induce mechanical cues. This vessel could be infected with Neisseria meningitidis, as a model of systemic infection. In this model, microcolony formation and dynamics, and effects on the host were very similar to those described for the human skin xenograft mouse, which is the current gold standard for these studies, and were consistent with observations made in patients. The model could also recapitulate the neutrophil response upon N. meningitidis systemic infection.

      Major comments:

      I have no major comments. The claims and the conclusions are supported by the data, the methods are properly presented and the data is analyzed adequately. Furthermore, I would like to propose an optional experiment could improve the manuscript. In the discussion it is stated that the vascular geometry might contribute to bacterial colonization in areas of lower velocity. It would be interesting to recapitulate this experimentally. It is of course optional but it would be of great interest, since this is something that can only be proven in the organ-on-chip (where flow speed can be tuned) and not as much in animal models. Besides, it would increase impact, demonstrating the superiority of the chip in this area rather than proving to be equal to current models.

      We have conducted additional experiments on infection in different vascular geometries now added these results figure 3/S3 and lines 288-305. We compared sheared stress levels as determined by Comsol simulation and experimentally determined bacterial adhesion sites. In the conditions used, the range of shear generated by the tested geometries do not appear to change the efficiency of bacterial adhesion. These results are consistent with a previous study from our group which show that in this range of shear stresses the effect on adhesion is limited (6) . Furthermore, qualitative observations in the animal model indicate that bacteria do not have an obvious preference in terms of binding site.

      Minor comments:

      I have a series of suggestions which, in my opinion, would improve the discussion. They are further elaborated in the following section, in the context of the limitations.

      • How to recapitulate the vessels in the context of a specific organ or tissue? If the pathogen is often found in the luminal space of other organs after disseminating from the blood, how can this process be recapitulated with this mode, if at all?

      For reasons that are not fully understood, postmortem histological studies reveal bacteria only inside blood vessels but rarely if ever in the organ parenchyma. The presence of intravascular bacteria could nevertheless alter cells in the tissue parenchyma. The notable exception is the brain where bacteria exit the bacterial lumen to access the cerebrospinal fluid. The chip we describe is fully adapted to develop a blood brain barrier model and more specific organ environments. This implies the addition of more cell types in the hydrogel. A paragraph on this topic has been added (Lines 548 and 552-570).

      • Similarly, could other immune responses related to systemic infection be recapitulated? The authors could discuss the potential of including other immune cells that might be found in the interstitial space, for example.

      This important discussion point has been added to the manuscript (L623-636). As suggested by Reviewer #2, other immune cells respond to N. meningitis and can be explored using our model. For instance, macrophages and dendritic cells are activated upon N. meningitis infection, eliminate the bacteria through phagocytosis, produce pro-inflammatory cytokines and chemokines potentially activating lymphocytes (7). Such an immune response, yet complex, would be interesting to study in our model as skin-xenograft mice are deprived of B and T lymphocytes to ensure acceptance of human skin grafts.

      • A minor correction: in line 467 it should probably be "aspects" instead of "aspect", and the authors could consider rephrasing that sentence slightly for increased clarity.

      We have corrected the sentence with "we demonstrated that our VoC strongly replicates key aspects of the in vivo human skin xenograft mouse model, the gold standard for studying meningococcal disease under physiological conditions." in lines 499-503.

      Strengths and limitations

      The most important strength of this manuscript is the technology they developed to build this model, which is impressive and very innovative. The Vessel-on-Chip can be tuned to acquire complex shapes and, according to the authors, the process has been optimized to produce models very quickly. This is a great advancement compared with the technologies used to produce other equivalent models. This model proves to be equivalent to the most advanced model used to date, but allows to perform microscopy with higher resolution and ease, which can in turn allow more complex and precise image-based analysis. However, the authors do not seem to present any new mechanistic insights obtained using this model. All the findings obtained in the infection-on-chip demonstrate that the model is equivalent to the human skin xenograft mouse model, and can offer superior resolution for microscopy. However, the advantages of the model do not seem to be exploited to obtain more insights on the pathogenicity mechanisms of N. meningitidis, host-pathogen interactions or potential applications in the discovery of potential treatments. For example, experiments to elucidate the role of certain N. meningiditis genes on infection could enrich the manuscript and prove the superiority of the model. However, I understand these experiments are time-consuming and out of the scope of the current manuscript. In addition, the model lacks the multicellularity that characterizes other similar models. The authors mention that the pathogen can be found in the luminal space of several organs, however, this luminal space has not been recapitulated in the model. Even though this would be a new project, it would be interesting that the authors hypothesize about the possibilities of combining this model with other organ models. The inclusion of circulating neutrophils is a great asset; however it would also be interesting to hypothesize about how to recapitulate other immune responses related to systemic infection.

      We thank Reviewer #2 for his/her comment on the strengths and limitations of our work. The difficulty is that our study opens many futur research directions and applications and we hope that the work serves as the basis for many future studies but one can only address a limited set of experiments in a single manuscript.

      • Experiments investigating the role of N. meningitidis genes require significant optimization of the system. Multiplexing is a potential avenue for future development, which would allow the testing of many mutants. The fast photoablation approach is particularly amenable to such adaptation.

      • Cells and bacteria inside the chambers could be isolated and analyzed at the transcriptomic level or by flow cytometry. This would imply optimizing a protocol for collecting cells from the device via collagenase digestion, for instance. This type of approach would also benefit from multiplexing to enhance the number of cells.

      • As mentioned above, the revised manuscript discusses the multicellular capabilities of our model, including the integration of additional immune cells and potential connections to other organ systems. We believe that these approaches are feasible and valuable for studying various aspects of N. meningitidis infection.

      Advance

      The most important advance of this manuscript is technical: the development of a model that proves to be equivalent to the most complex model used to date to study meningococcal systemic infections. The human skin xenograft mouse model requires complex surgical techniques and has the practical and ethical limitations associated with the use of animals. However, the Infection-on-chip model is completely in vitro, can be produced quickly, and allows to precisely tune the vessel’s geometry and to perform higher resolution microscopy. Both models were comparable in terms of the hallmarks defining the disease, suggesting that the presented model can be an effective replacement of the animal use in this area.

      Other vessel-on-chip models can recapitulate an endothelial barrier in a tube-like morphology, but do not recapitulate other complex geometries, that are more physiologically relevant and could impact infection (in addition to other non-infectious diseases). However, in the manuscript it is not clear whether the different morphologies are necessary to study or recapitulate N. meningitidis infection, or if the tubular morphologies achieved in other similar models would suffice.

      Audience

      This manuscript might be of interest for a specialized audience focusing on the development of microphysiological models. The technology presented here can be of great interest to researchers whose main area of interest is the endothelium and the blood vessels, for example, researchers on the study of systemic infections, atherosclerosis, angiogenesis, etc. Thus, the tool presented (vessel-on-chip) can have great applications for a broad audience. However, even when the method might be faster and easier to use than other equivalent methods, it could still be difficult to implement in another laboratory, especially if it lacks expertise in bioengineering. Therefore, the method could be more of interest for laboratories with expertise in bioengineering looking to expand or optimize their toolbox. Alternatively, this paper present itself as an opportunity to begin collaborations, since the model could be used to test other pathogen or conditions.

      Field of expertise:

      Infection biology, organ-on-chip, fungal pathogens.

      I lack the expertise to evaluate the image-based analysis.

      References

      (1) Gyohei Egawa, Satoshi Nakamizo, Yohei Natsuaki, Hiromi Doi, Yoshiki Miyachi, and Kenji Kabashima. Intravital analysis of vascular permeability in mice using two-photon microscopy. Scientific Reports, 3(1):1932, Jun 2013. ISSN 2045-2322. doi: 10.1038/srep01932.

      (2) Valeria Manriquez, Pierre Nivoit, Tomas Urbina, Hebert Echenique-Rivera, Keira Melican, Marie-Paule Fernandez-Gerlinger, Patricia Flamant, Taliah Schmitt, Patrick Bruneval, Dorian Obino, and Guillaume Duménil. Colonization of dermal arterioles by neisseria meningitidis provides a safe haven from neutrophils. Nature Communications, 12(1):4547, Jul 2021. ISSN 2041-1723. doi: 10.1038/s41467-021-24797-z.

      (3) Mats Hellström, Holger Gerhardt, Mattias Kalén, Xuri Li, Ulf Eriksson, Hartwig Wolburg, and Christer Betsholtz. Lack of pericytes leads to endothelial hyperplasia and abnormal vascular morphogenesis. Journal of Cell Biology, 153(3):543–554, Apr 2001. ISSN 0021-9525. doi: 10.1083/jcb.153.3.543.

      (4) Arsheen M. Rajan, Roger C. Ma, Katrinka M. Kocha, Dan J. Zhang, and Peng Huang. Dual function of perivascular fibroblasts in vascular stabilization in zebrafish. PLOS Genetics, 16(10):1–31, 10 2020. doi: 10.1371/journal.pgen.1008800.

      (5) Huanhuan He, Julia J. Mack, Esra Güç, Carmen M. Warren, Mario Leonardo Squadrito, Witold W. Kilarski, Caroline Baer, Ryan D. Freshman, Austin I. McDonald, Safiyyah Ziyad, Melody A. Swartz, Michele De Palma, and M. Luisa Iruela-Arispe. Perivascular macrophages limit permeability. Arteriosclerosis, Thrombosis, and Vascular Biology, 36(11):2203–2212, 2016. doi: 10.1161/ATVBAHA. 116.307592.

      (6) Emilie Mairey, Auguste Genovesio, Emmanuel Donnadieu, Christine Bernard, Francis Jaubert, Elisabeth Pinard, Jacques Seylaz, Jean-Christophe Olivo-Marin, Xavier Nassif, and Guillaume Dumenil. Cerebral microcirculation shear stress levels determine Neisseria meningitidis attachment sites along the blood–brain barrier . Journal of Experimental Medicine, 203(8):1939–1950, 07 2006. ISSN 0022-1007. doi: 10.1084/jem.20060482.

      (7) Riya Joshi and Sunil D. Saroj. Survival and evasion of neisseria meningitidis from macrophages. Medicine in Microecology, 17:100087, 2023. ISSN 2590-0978. doi: https://doi.org/10.1016/j.medmic. 2023.100087.

    1. Author Response:

      Assessment note: “Whereas the results and interpretations are generally solid, the mechanistic aspect of the work and conclusions put forth rely heavily on in vitro studies performed in cultured L6 myocytes, which are highly glycolytic and generally not viewed as a good model for studying muscle metabolism and insulin action.”

      While we acknowledge that in vitro models may not fully recapitulate the complexity of in vivo systems, we believe that our use of L6 myotubes is appropriate for studying the mechanisms underlying muscle metabolism and insulin action. As mentioned below (reviewer 2, point 1), L6 myotubes possess many important characteristics relevant to our research, including high insulin sensitivity and a similar mitochondrial respiration sensitivity to primary muscle fibres. Furthermore, several studies have demonstrated the utility of L6 myotubes as a model for studying insulin sensitivity and metabolism, including our own previous work (PMID: 19805130, 31693893, 19915010).

      In addition, we have provided evidence of the similarities between L6 cells overexpressing SMPD5 and human muscle biopsies at protein levels and the reproducibility of the negative correlation between ceramide and Coenzyme Q observed in L6 cells in vivo, specifically in the skeletal muscle of mice in chow diet. These findings support the relevance of our in vitro results to in vivo muscle metabolism.

      Finally, we will supplement our findings by demonstrating a comparable relationship between ceramide and Coenzyme Q in mice exposed to a high-fat diet, to be shown in Supplementary Figure 4 H-I. Further animal experiments will be performed to validate our cell-line based conclusions. We hope that these additional results address the concerns raised by the reviewer and further support the relevance of our in vitro findings to in vivo muscle metabolism and insulin action.

      Points from reviewer 1:

      1. Although the authors' results suggest that higher mitochondrial ceramide levels suppress cellular insulin sensitivity, they rely solely on a partial inhibition (i.e., 30%) of insulin-stimulated GLUT4-HA translocation in L6 myocytes. It would be critical to examine how much the increased mitochondrial ceramide would inhibit insulin-induced glucose uptake in myocytes using radiolabel deoxy-glucose.

      Response: The primary impact of insulin is to facilitate the translocation of glucose transporter type 4 (GLUT4) to the cell surface, which effectively enhances the maximum rate of glucose uptake into cells. Therefore, assessing the quantity of GLUT4 present at the cell surface in non-permeabilized cells is widely regarded as the most reliable measure of insulin sensitivity (PMID: 36283703, 35594055, 34285405). Additionally, plasma membrane GLUT4 and glucose uptake are highly correlated. Whilst we have routinely measured glucose uptake with radiolabelled glucose in the past, we do not believe that evaluating glucose uptake provides a better assessment of insulin sensitivity than GLUT4.

      We will clarify the use of GLUT4 translocation in the Results section:

      “...For this reason, several in vitro models have been employed involving incubation of insulin sensitive cell types with lipids such as palmitate to mimic lipotoxicity in vivo. In this study we will use cell surface GLUT4-HA abundance as the main readout of insulin response...”

      1. Another important question to be addressed is whether glycogen synthesis is affected in myocytes under these experimental conditions. Results demonstrating reductions in insulin-stimulated glucose transport and glycogen synthesis in myocytes with dysfunctional mitochondria due to ceramide accumulation would further support the authors' claim.

      Response: We have carried out supplementary experiments to investigate glycogen synthesis in our insulin-resistant models. Our approach involved L6-myotubes overexpressing the mitochondrial-targeted construct ASAH1 (as described in Fig. 3). We then challenged them with palmitate and measured glycogen synthesis using 14C radiolabeled glucose. Our observations indicated that palmitate suppressed insulin-induced glycogen synthesis, which was effectively prevented by the overexpression of ASAH1 (N = 5, * p<0.05). These results provide additional evidence highlighting the role of dysfunctional mitochondria in muscle cell glucose metabolism.

      These data will be added to Supplementary Figure 4K and the results modified as follows:

      “Notably, mtASAH1 overexpression protected cells from palmitate-induced insulin resistance without affecting basal insulin sensitivity (Fig. 3E). Similar results were observed using insulin-induced glycogen synthesis as an ortholog technique for Glut4 translocation. These results provide additional evidence highlighting the role of dysfunctional mitochondria in muscle cell glucose metabolism (Sup. Fig. 5K). Importantly, mtASAH1 overexpression did not rescue insulin sensitivity in cells depleted…”

      We will add to the method section:

      “L6 myotubes overexpressing ASAH were grown and differentiated in 12-well plates, as described in the Cell lines section, and stimulated for 16 h with palmitate-BSA or EtOH-BSA, as detailed in the Induction of insulin resistance section.

      On day seven of differentiation, myotubes were serum starved in plain DMEM for 3 and a half hours. After incubation for 1 hour at 37C with 2 µCi/ml D-[U-14C]-glucose in the presence or absence of 100 nM insulin, glycogen synthesis assay was performed, as previously described (Zarini S. et al., J Lipid Res, 63(10): 100270, 2022).”

      1. In addition, it would be critical to assess whether the increased mitochondrial ceramide and consequent lowering of energy levels affect all exocytic pathways in L6 myoblasts or just the GLUT4 trafficking. Is the secretory pathway also disrupted under these conditions?

      Response: As the secretory pathway primarily involves the synthesis and transportation of soluble proteins that are secreted into the extracellular space, and given that the majority of cellular transmembrane proteins (excluding those of the mitochondria) use this pathway to arrive at their ultimate destination, we believe that the question posed by the reviewer is highly challenging and beyond the scope of our research. We will add this to the discussion:

      “...the abundance of mPTP associated proteins suggesting a role of this pore in ceramide induced insulin resistance (Sup. Fig. 6E). In addition, it is yet to be determined whether the trafficking defect is specific to Glut4 or if it affects the exocytic-secretory pathway more broadly…”

      Points from reviewer 2:

      1. The mechanistic aspect of the work and conclusions put forth rely heavily on studies performed in cultured myocytes, which are highly glycolytic and generally viewed as a poor model for studying muscle metabolism and insulin action. Nonetheless, the findings provide a strong rationale for moving this line of investigation into mouse gain/loss of function models.

      Response: The relative contribution of the anaerobic (glycolysis) and aerobic (mitochondria) contribution to the muscle metabolism can change in L6 depending on differentiation stage. For instance, Serrage et al (PMID30701682) demonstrated that L6-myotubes have a higher mitochondrial abundance and aerobic metabolism than L6-myoblasts. Others have used elegant transcriptomic analysis and metabolic characterisation comparing different skeletal muscle models for studying insulin sensitivity. For instance, Abdelmoez et al in 2020 (PMID31825657) reported that L6 myotubes exhibit greater insulin-stimulated glucose uptake and oxidative capacity compared with C2C12 and Human Mesenchymal Stem Cells (HMSC). Overall, L6 cells exhibit higher metabolic rates and primarily rely on aerobic metabolism, while C2C12 and HSMC cells rely on anaerobic glycolysis. It is worth noting that L6 myotubes are the cell line most closely related to adult human muscle when compared with other muscle cell lines (PMID31825657). Our presented results in Figure 6 H and I provide evidence for the similarities between L6 cells overexpressing SMPD5 and human muscle biopsies. Additionally, in Figure 3J-K, we demonstrate the reproducibility of the negative correlation between ceramide and Coenzyme Q observed in L6 cells in vivo, specifically in the skeletal muscle of mice in chow diet. Furthermore, we have supplemented these findings by demonstrating a comparable relationship in mice exposed to a high-fat diet, as shown in Supplementary Figure 4 H-I (refer to point 4). We will clarify these points in the Discussion:

      “In this study, we mainly utilised L6-myotubes, which share many important characteristics with primary muscle fibres relevant to our research. Both types of cells exhibit high sensitivity to insulin and respond similarly to maximal doses of insulin, with Glut4 translocation stimulated between 2 to 4 times over basal levels in response to 100 nM insulin (as shown in Fig. 1-4 and (46,47)). Additionally, mitochondrial respiration in L6-myotubes have a similar sensitivity to mitochondrial poisons, as observed in primary muscle fibres (as shown in Fig. 5 (48)). Finally, inhibiting ceramide production increases CoQ levels in both L6-myotubes and adult muscle tissue (as shown in Fig. 2-3). Therefore, L6-myotubes possess the necessary metabolic features to investigate the role of mitochondria in insulin resistance, and this relationship is likely applicable to primary muscle fibres”.

      We will also add additional data - in point 2 - from differentiated human myocytes that are consistent with our observations from the L6 models. Additional experiments are in progress to further extend these findings.

      1. One caveat of the approach taken is that exposure of cells to palmitate alone is not reflective of in vivo physiology. It would be interesting to know if similar effects on CoQ are observed when cells are exposed to a more physiological mixture of fatty acids that includes a high ratio of palmitate, but better mimics in vivo nutrition.

      Response: Palmitate is widely recognized as a trigger for insulin resistance and ceramide accumulation, which mimics the insulin resistance induced by a diet in rodents and humans. Previous studies have compared the effects of a lipid mixture versus palmitate on inducing insulin resistance in skeletal muscle, and have found that the strong disruption in insulin sensitivity caused by palmitate exposure was lessened with physiologic mixtures of fatty acids, even with a high proportion of saturated fatty acids. This was associated, in part, to the selective partitioning of fatty acids into neutral lipids (such as TAG) when muscle cells are exposed to physiologic lipid mixtures (Newsom et al PMID25793412). Hence, we think that using palmitate is a better strategy to study lipid-induced insulin resistance in vitro. We will add to results:

      “In vitro, palmitate conjugated with BSA is the preferred strategy for inducing insulin resistance, as lipid mixtures tend to partition into triacylglycerides (33)”.

      We are also performing additional in vivo experiments to add to the physiological relevance of the findings.

      1. While the utility of targeting SMPD5 to the mitochondria is appreciated, the results in Figure 5 suggest that this manoeuvre caused a rather severe form of mitochondrial dysfunction. This could be more representative of toxicity rather than pathophysiology. It would be helpful to know if these same effects are observed with other manipulations that lower CoQ to a similar degree. If not, the discrepancies should be discussed.

      Response: We conducted a staining procedure using the mitochondrial marker mitoDsRED to observe the effect of SMPD5 overexpression on cell toxicity. The resulting images, displayed in the figure below (Author response image 1), demonstrate that the overexpression of SMPD5 did not result in any significant changes in cell morphology or impact the differentiation potential of our myoblasts into myotubes.

      Author response image 1.

      In addition, we evaluated cell viability in HeLa cells following exposure to SACLAC (2 uM) to induce CoQ depletion (left panel). Specifically, we measured cell death by monitoring the uptake of Propidium iodide (PI) as shown in the right panel. Our results demonstrated that Saclac-induced CoQ depletion did not lead to cell death at the doses used for CoQ depletion (Author response image 2).

      Author response image 2.

      Therefore, we deemed it improbable that the observed effect is caused by cellular toxicity, but rather represents a pathological condition induced by elevated levels of ceramides. We will add to discussion:

      “...downregulation of the respirasome induced by ceramides may lead to CoQ depletion. Despite the significant impact of ceramide on mitochondrial respiration, we did not observe any indications of cell damage in any of the treatments, suggesting that our models are not explained by toxic/cell death events.”

      1. The conclusions could be strengthened by more extensive studies in mice to assess the interplay between mitochondrial ceramides, CoQ depletion and ETC/mitochondrial dysfunction in the context of a standard diet versus HF diet-induced insulin resistance. Does P053 affect mitochondrial ceramide, ETC protein abundance, mitochondrial function, and muscle insulin sensitivity in the predicted directions?

      Response: We would like to note that the metabolic characterization and assessment of ETC/mitochondrial function in these mice (both fed a high-fat (HF) and chow diet, with or without P053) were previously published (Turner N, PMID30131496). In addition to this, we have conducted targeted metabolomic and lipidomic analyses to investigate the impact of P053 on ceramide and CoQ levels in HF-fed mice. As illustrated in the figures below (Author response image 3), the administration of P053 led to a reduction in ceramide levels (left panel) and an increase in CoQ levels (right panel) in HF-fed mice, which is consistent with our in vitro findings.

      Author response image 3.

      We will add to results:

      “…similar effect was observed in mice exposed to a high fat diet for 5 wks (Supp. Fig. 4H-I further phenotypic and metabolic characterization of these animals can be found in (41))”

      We will further perform more in-vivo studies to corroborate these findings.

    1. Author response:

      eLife assessment

      This useful study reports how neuronal activity in the prefrontal cortex maps time intervals during which animals have to wait until reaching a reward and how this mapping is preserved across days. However, the evidence supporting the claims is incomplete as these sequential neuronal patterns do not necessarily represent time but instead may be correlated with stereotypical behavior and restraint from impulsive decision, which would require further controls (e.g. behavioral analysis) to clarify the main message. The study will be of interest to neuroscientists interested in decision making and motor control. 

      We thank the editors and reviewers for the constructive comments. In light of the questions mentioned by the reviewers, we plan to perform additional analyses in our revision, particularly aiming to address issues related to single-cell scalability, and effects of motivation and movement. We believe these additional data will greatly improve the rigor and clarity of our study. We are grateful for the review process of eLife.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      This paper investigates the neural population activity patterns of the medial frontal cortex in rats performing a nose poking timing task using in vivo calcium imaging. The results showed neurons that were active at the beginning and end of the nose poking and neurons that formed sequential patterns of activation that covaried with the timed interval during nose poking on a trial-by-trial basis. The former were not stable across sessions, while the latter tended to remain stable over weeks. The analysis on incorrect trials suggests the shorter non-rewarded intervals were due to errors in the scaling of the sequential pattern of activity. 

      Strengths:

      This study measured stable signals using in vivo calcium imaging during experimental sessions that were separated by many days in animals performing a nose poking timing task. The correlation analysis on the activation profile to separate the cells in the three groups was effective and the functional dissociation between beginning and end, and duration cells was revealing. The analysis on the stability of decoding of both the nose poking state and poking time was very informative. Hence, this study dissected a neural population that formed sequential patterns of activation that encoded timed intervals. 

      We thank the reviewer for the positive comments.

      Weaknesses: 

      It is not clear whether animals had enough simultaneously recorded cells to perform the analyzes of Figures 2-4. In fact, rat 3 had 18 responsive neurons which probably is not enough to get robust neural sequences for the trial-by-trial analysis and the correct and incorrect trial analysis. 

      We thank the reviewer for the comment. We would like to mention that the 18 cells plotted in Supplementary figure 1 were only from the duration cell category. To improve the clarity of our results, we are going to provide information regarding the number of cells from each rat in our revision. In general, we imaged more than 50 cells from each rat. We would also like to point to the data from individual trials in Supplementary figure 1B showing robust sequentiality.

      In addition, the analysis of behavioral errors could be improved. The analysis in Figure 4A could be replaced by a detailed analysis on the speed, and the geometry of neural population trajectories for correct and incorrect trials.

      We thank the reviewer for the suggestions. We are going to conduct the analysis as the reviewer recommended. We agree with the reviewer that better presentation of the neural activity will be helpful for the readers.

      In the case of Figure 4G is not clear why the density of errors formed two clusters instead of having a linear relation with the produce duration. I would be recommendable to compute the scaling factor on neuronal population trajectories and single cell activity or the computation of the center of mass to test the type III errors. 

      We would like to mention that the prediction errors plotted in this graph were calculated from two types of trials. The correct trials tended to show positive time estimation errors while the incorrect trials showed negative time estimation errors. We believe that the polarity switch between these two types suggested a possible use of this neural mechanism to time the action of the rats.

      In addition, we are going to perform the analysis suggested by the reviewer in our revision. We agree that different ways of analyzing the data would provide better characterization of the scaling effect.

      Due to the slow time resolution of calcium imaging, it is difficult to perform robust analysis on ramping activity. Therefore, I recommend downplaying the conclusion that: "Together, our data suggest that sequential activity might be a more relevant coding regime than the ramping activity in representing time under physiological conditions." 

      We agree with the reviewer and we have mentioned this caveat in our original manuscript. We are going to rephrase the sentence as the reviewer suggested during our revision.

      Reviewer #2 (Public Review):

      In this manuscript, Li and collaborators set out to investigate the neuronal mechanisms underlying "subjective time estimation" in rats. For this purpose, they conducted calcium imaging in the prefrontal cortex of water-restricted rats that were required to perform an action (nosepoking) for a short duration to obtain drops of water. The authors provided evidence that animals progressively improved in performing their task. They subsequently analyzed the calcium imaging activity of neurons and identify start, duration, and stop cells associated with the nose poke. Specifically, they focused on duration cells and demonstrated that these cells served as a good proxy for timing on a trial-by-trial basis, scaling their pattern of actvity in accordance with changes in behavioral performance. In summary, as stated in the title, the authors claim to provide mechanistic insights into subjective time estimation in rats, a function they deem important for various cognitive conditions. 

      This study aligns with a wide range of studies in system neuroscience that presume that rodents solve timing tasks through an explicit internal estimation of duration, underpinned by neuronal representations of time. Within this framework, the authors performed complex and challenging experiments, along with advanced data analysis, which undoubtedly merits acknowledgement. However, the question of time perception is a challenging one, and caution should be exercised when applying abstract ideas derived from human cognition to animals. Studying so-called time perception in rats has significant shortcomings because, whether acknowledged or not, rats do not passively estimate time in their heads. They are constantly in motion. Moreover, rats do not perform the task for the sake of estimating time but to obtain their rewards are they water restricted. Their behavior will therefore reflects their motivation and urgency to obtain rewards. Unfortunately, it appears that the authors are not aware of these shortcomings. These alternative processes (motivation, sensorimotor dynamics) that occur during task performance are likely to influence neuronal activity. Consequently, my review will be rather critical. It is not however intended to be dismissive. I acknowledge that the authors may have been influenced by numerous published studies that already draw similar conclusions. Unfortunately, all the data presented in this study can be explained without invoking the concept of time estimation. Therefore, I hope the authors will find my comments constructive and understand that as scientists, we cannot ignore alternative interpretations, even if they conflict with our a priori philosophical stance (e.g., duration can be explicitly estimated by reading neuronal representation of time) and anthropomorphic assumptions (e.g., rats estimate time as humans do). While space is limited in a review, if the authors are interested, they can refer to a lengthy review I recently published on this topic, which demonstrates that my criticism is supported by a wide range of timing experiments across species (Robbe, 2023). In addition to this major conceptual issue that cast doubt on most of the conclusions of the study, there are also several major statistical issues. 

      Main Concerns 

      (1) The authors used a task in which rats must poke for a minimal amount of time (300 ms and then 1500 ms) to be able to obtain a drop of water delivered a few centimeters right below the nosepoke. They claim that their task is a time estimation task. However, they forget that they work with thirsty rats that are eager to get water sooner than later (there is a reason why they start by a short duration!). This task is mainly probing the animals ability to wait (that is impulse control) rather than time estimation per se. Second, the task does not require to estimate precisely time because there appear to be no penalties when the nosepokes are too short or when they exceed. So it will be unclear if the variation in nosepoke reflects motivational changes rather than time estimation changes. The fact that this behavioral task is a poor assay for time estimation and rather reflects impulse control is shown by the tendency of animals to perform nose-pokes that are too short, the very slow improvement in their performance (Figure 1, with most of the mice making short responses), and the huge variability. Not only do the behavioral data not support the claim of the authors in terms of what the animals are actually doing (estimating time), but this also completely annhilates the interpretation of the Ca++ imaging data, which can be explained by motivational factors (changes in neuronal activity occurring while the animals nose poke may reflect a growing sens of urgency to check if water is available). 

      We would like to respond to the reviewer’s comments 1, 2 and 4 together since they all focus on the same issue. We thank the reviewer for the very thoughtful comments and for sharing his detailed reasoning from a recently published review (Robbe, 2023). A lot of the discussion goes beyond the scope of this study and we agree that whether there is an explicit representation of time (an internal clock) in the brain is a difficult question to answer, particularly by using animal behaviors. In fact, even with fully conscious humans and elaborated task design, we think it is still questionable to clearly dissociate the neural substrate of “timing” from “motor”. In the end, it may as well be that as the reviewer cited from Bergson’s article, the experience of time cannot be measured.

      Studying the neural representation of any internal state may suffer from the same ambiguity. With all due respect, however, we would like to limit our response in the scope of our results. According to the reviewer, two alternative interpretations of the task-related sequential activity exist: 1, duration cells may represent fidgeting or orofacial movements and 2, duration cells may represent motivation or motion plan of the rats. To test the first alternative interpretation, we will perform a more comprehensive analysis of the behavior data at all the limbs and visible body parts of the rat during nose poke and analyze its periodicity among different trials, although the orofacial movements may not be visible to us.

      Regarding the second alternative interpretation, we think our data in the original Figure 4G argues against it. In this graph, we plotted the decoding error of time using the duration cells’ activity against the actual duration of the trials. If the sequential activity of durations cells only represents motivation, then the errors should distribute evenly across different trial times, or linearly modulated by trial durations. The unimodal distribution we observed (Figure 4G and see Author response image 1 below for a re-plot without signs) suggests that the scaling factor of the sequential activity represents information related to time. And the fact that this unimodal distribution centered at the time threshold of the task provides strong evidence for the active use of scaling factor for time estimation. In order to further test the relationship to motivation, we will measure the time interval between exiting nose poke to the start of licking water reward as an independent measurement of motivation for each trial. We will analyze and report whether this measurement correlates with the nose poking durations in our data in the revision.

      Author response image 1.

      Furthermore, whether the scaling sequential activity we report represents behavioral timing or true time estimation, the reviewer would agree that these activities correlate with the animal’s nose poking durations, and a previous study has showed that PFC silencing led to disruption of the mouse’s timing behavior (PMID: 24367075). The main surprising finding of the paper is that these duration cells are different from the start and end cells in terms of their coding stability. Thus, future studies dissecting the anatomical microcircuit of these duration cells may provide further clue regarding whether they receive inputs from thirst or reward-related brain regions. This may help partially resolve the “time” vs. “motor” debate the reviewer mentioned.

      (2) A second issue is that the authors seem to assume that rats are perfectly immobile and perform like some kind of robots that would initiate nose pokes, maintain them, and remove them in a very discretized manner. However, in this kind of task, rats are constantly moving from the reward magazine to the nose poke. They also move while nose-poking (either their body or their mouth), and when they come out of the nose poke, they immediately move toward the reward spout. Thus, there is a continuous stream of movements, including fidgeting, that will covary with timing. Numerous studies have shown that sensorimotor dynamics influence neural activity, even in the prefrontal cortex. Therefore, the authors cannot rule out that what the records reflect are movements (and the scaling of movement) rather than underlying processes of time estimation (some kind of timer). Concretely, start cells could represent the ending of the movement going from the water spout to the nosepoke, and end cells could be neurons that initiate (if one can really isolate any initiation, which I doubt) the movement from the nosepoke to the water spout. Duration cells could reflect fidgeting or orofacial movements combined with an increasing urgency to leave the nose pokes.

      (3)The statistics should be rethought for both the behavioral and neuronal data. They should be conducted separately for all the rats, as there is likely interindividual variability in the impulsivity of the animals.

      We thank the reviewer for the comment, yet we are not quite sure what specifically was asked by the reviewer. There is undoubtedly variance among individual animals. One of the core reasons for statistical comparison is to compare the group difference with the variance due to sampling. It appears that the reviewer would like to require we conduct our analysis using each rat individually. We will conduct and report analysis with individual rat in Figure 1C, Figure 2C, G, K, Figure 4F in our revised manuscript.

      (4) The fact that neuronal activity reflects an integration of movement and motivational factors rather than some abstract timing appears to be well compatible with the analysis conducted on the error trials (Figure 4), considering that the sensorimotor and motivational dynamics will rescale with the durations of the nose poke. 

      (5) The authors should mention upfront in the main text (result section) the temporal resolution allowed by their Ca+ probe and discuss whether it is fast enough in regard of behavioral dynamics occurring in the task. 

      We thank the reviewer for the suggestion. We have originally mentioned the caveat of calcium imaging in the interpretation of our results. We will incorporate more texts for this purpose during our revision. In terms of behavioral dynamics (start and end of nose poke in this case), we think calcium imaging could provide sufficient kinetics. However, the more refined dynamics related to the reproducibility of the sequential activity or the precise representation of individual cells on the scaled duration may be benefited from improved time resolution.

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      (1) Please refer explicitly to the three types of cells in the abstract. 

      We will modify the abstract as suggested during revision.

      (2) Please refer to the work of Betancourt et al., 2023 Cell Reports, where a trial-by-trail analysis on the correlation between neural trajectory dynamics in MPC and timing behavior is reported. In that same paper the stability of neural sequences across task parameters is reported. 

      We will cite and discuss this study in our revised paper.

      (3) Please state the number of studied animals at the beginning of the results section. 

      We will provide this information as requested. The number of animals were also plotted in Figure 1D for each analysis.

      (4) Why do the middle and right panels of Figure 2E show duration cells. 

      Figure 2E was intended to show examples of duration cells’ activity. We included different examples of cells that peak at different points in the scaled duration. We believe these multiple examples would give the readers a straight forward impression of these cells’ activity patterns.

      (5) Which behavioral sessions of Figure 1B were analyzed further. 

      We will label the analyzed sessions in Figure 1B during our revision.

      (6) In Figure 3A-C please increase the time before the beginning of the trial in order to visualize properly the activation patterns of the start cells. 

      We thank the reviewer for the suggestion and will modify the figure accordingly during revision.

      (7) Please state what could be the behavioral and functional effect of the ablation of the cortical tissue on top of mPFC. 

      We thank the reviewer for the question. In our experience, mice with lens implanted in mPFC did not show observable different to mice without surgery regarding the acquisition of the task and the distribution of the nose-poke durations. Although we could not rule out the effect on other cognitive process, the mice appeared to be intact in the scope of our task. We will provide these behavior data during our revision.

    1. Author response:

      Public Reviews:

      Reviewer #1 (Public review):

      The authors present exciting new experimental data on the antigenic recognition of 78 H3N2 strains (from the beginning of the 2023 Northern Hemisphere season) against a set of 150 serum samples. The authors compare protection profiles of individual sera and find that the antigenic effect of amino acid substitutions at specific sites depends on the immune class of the sera, differentiating between children and adults. Person-to-person heterogeneity in the measured titers is strong, specifically in the group of children's sera. The authors find that the fraction of sera with low titers correlates with the inferred growth rate using maximum likelihood regression (MLR), a correlation that does not hold for pooled sera. The authors then measure the protection profile of the sera against historical vaccine strains and find that it can be explained by birth cohort for children. Finally, the authors present data comparing pre- and post- vaccination protection profiles for 39 (USA) and 8 (Australia) adults. The data shows a cohort-specific vaccination effect as measured by the average titer increase, and also a virus-specific vaccination effect for the historical vaccine strains. The generated data is shared by the authors and they also note that these methods can be applied to inform the bi-annual vaccine composition meetings, which could be highly valuable.

      Thanks for this nice summary of our paper.

      The following points could be addressed in a revision:

      (1) The authors conclude that much of the person-to-person and strain-to-strain variation seems idiosyncratic to individual sera rather than age groups. This point is not yet fully convincing. While the mean titer of an individual may be idiosyncratic to the individual sera, the strain-to-strain variation still reveals some patterns that are consistent across individuals (the authors note the effects of substitutions at sites 145 and 275/276). A more detailed analysis, removing the individual-specific mean titer, could still show shared patterns in groups of individuals that are not necessarily defined by the birth cohort.

      As the reviewer suggests, we normalized the titers for all sera to the geometric mean titer for each individual in the US-based pre-vaccination adults and children. This is only for the 2023-circulating viral strains. We then faceted these normalized titers by the same age groups we used in Figure 6, and the resulting plot is shown below. Although there are differences among virus strains (some are better neutralized than others), there are not obvious age group-specific patterns (eg, the trends in the two facets are similar). To us this suggests that at least for these relatively closely related recent H3N2 strains, the strain-to-strain variation does not obviously segregate by age group. Obviously, it is possible (we think likely) that there would be more obvious age-group specific trends if we looked at a larger swath of viral strains covering a longer time range (eg, over decades of influenza evolution). We plan to add the new plots shown below to a supplemental figure in the revised manuscript.

      Author response image 1.

      Author response image 2.

      (2) The authors show that the fraction of sera with a titer below 138 correlates strongly with the inferred growth rate using MLR. However, the authors also note that there exists a strong correlation between the MLR growth rate and the number of HA1 mutations. This analysis does not yet show that the titers provide substantially more information about the evolutionary success. The actual relation between the measured titers and fitness is certainly more subtle than suggested by the correlation plot in Figure 5. For example, the clades A/Massachusetts and A/Sydney both have a positive fitness at the beginning of 2023, but A/Massachusetts has substantially higher relative fitness than A/Sydney. The growth inference in Figure 5b does not appear to map that difference, and the antigenic data would give the opposite ranking. Similarly, the clades A/Massachusetts and A/Ontario have both positive relative fitness, as correctly identified by the antigenic ranking, but at quite different times (i.e., in different contexts of competing clades). Other clades, like A/St. Petersburg are assigned high growth and high escape but remain at low frequency throughout. Some mention of these effects not mapped by the analysis may be appropriate.

      Thanks for the nice summary of our findings in Figure 5. However, the reviewer is misreading the growth charts when they say that A/Massachusetts/18/2022 has a substantially higher fitness than A/Sydney/332/2023. Figure 5a shows the frequency trajectory of different variants over time. While A/Massachusetts/18/2022 reaches a higher frequency than A/Sydney/332/2023, the trajectory is similar and the reason that A/Massachusetts/18/2022 reached a higher max frequency is that it started at a higher frequency at the beginning of 2023. The MLR growth rate estimates differ from the maximum absolute frequency reached: instead, they reflect how rapidly each strain grows relative to others. In fact, A/Massachusetts/18/2022 and A/Sydney/332/2023 have similar growth rates, as shown in Supplementary Figure 6b. Similarly, A/Saint-Petersburg/RII-166/2023 starts at a low initial frequency but then grows even as A/Massachusetts/18/2022 and A/Sydney/332/2023 are declining, and so has a higher growth rate than both of those. In the revised manuscript, we will clarify how viral growth rates are estimated from frequency trajectories, and how growth rate differs from max frequency.

      (3) For the protection profile against the vaccine strains, the authors find for the adult cohort that the highest titer is always against the oldest vaccine strain tested, which is A/Texas/50/2012. However, the adult sera do not show an increase in titer towards older strains, but only a peak at A/Texas. Therefore, it could be that this is a virus-specific effect, rather than a property of the protection profile. Could the authors test with one older vaccine virus (A/Perth/16/2009?) whether this really can be a general property?

      We are interested in studying immune imprinting more thoroughly using sequencing-based neutralization assays, but we note that the adults in the cohorts we studied would have been imprinted with much older strains than included in this library. As this paper focuses on the relative fitness of contemporary strains with minor secondary points regarding imprinting, these experiments are beyond the scope of this study. We’re excited for future work (from our group or others) to explore these points by making a new virus library with strains from multiple decades of influenza evolution.

      Reviewer #2 (Public review):

      This is an excellent paper. The ability to measure the immune response to multiple viruses in parallel is a major advancement for the field, which will be relevant across pathogens (assuming the assay can be appropriately adapted). I only have a few comments, focused on maximising the information provided by the sera.

      Thanks very much!

      Firstly, one of the major findings is that there is wide heterogeneity in responses across individuals. However, we could expect that individuals' responses should be at least correlated across the viruses considered, especially when individuals are of a similar age. It would be interesting to quantify the correlation in responses as a function of the difference in ages between pairs of individuals. I am also left wondering what the potential drivers of the differences in responses are, with age being presumably key. It would be interesting to explore individual factors associated with responses to specific viruses (beyond simply comparing adults versus children).

      We’re excited by this idea! We plan to include these analyses in our revised pre-print.

      Relatedly, is the phylogenetic distance between pairs of viruses associated with similarity in responses?

      As above, we like this idea and our revised pre-print will include this analysis.

      Figure 5C is also a really interesting result. To be able to predict growth rates based on titers in the sera is fascinating. As touched upon in the discussion, I suspect it is really dependent on the representativeness of the sera of the population (so, e.g., if only elderly individuals provided sera, it would be a different result than if only children provided samples). It may be interesting to compare different hypotheses - so e.g., see if a population-weighted titer is even better correlated with fitness - so the contribution from each individual's titer is linked to a number of individuals of that age in the population. Alternatively, maybe only the titers in younger individuals are most relevant to fitness, etc.

      We’re very interested in these analyses, but suggest they may be better explored in subsequent works that could sample more children, teenagers and adults across age groups. Our sera set, as the reviewer suggests, may be under-powered to perform the proposed analysis on subsetted age groups of our larger age cohorts.

      In Figure 6, the authors lump together individuals within 10-year age categories - however, this is potentially throwing away the nuances of what is happening at individual ages, especially for the children, where the measured viruses cross different groups. I realise the numbers are small and the viruses only come from a small numbers of years, however, it may be preferable to order all the individuals by age (y-axis) and the viral responses in ascending order (x-axis) and plot the response as a heatmap. As currently plotted, it is difficult to compare across panels

      This is a good suggestion, and a revised pre-print will include heatmaps of the different cohorts, ordered by ages of individuals.

      Reviewer #3 (Public review):

      The authors use high-throughput neutralisation data to explore how different summary statistics for population immune responses relate to strain success, as measured by growth rate during the 2023 season. The question of how serological measurements relate to epidemic growth is an important one, and I thought the authors present a thoughtful analysis tackling this question, with some clear figures. In particular, they found that stratifying the population based on the magnitude of their antibody titres correlates more with strain growth than using measurements derived from pooled serum data. However, there are some areas where I thought the work could be more strongly motivated and linked together. In particular, how the vaccine responses in US and Australia in Figures 6-7 relate to the earlier analysis around growth rates, and what we would expect the relationship between growth rate and population immunity to be based on epidemic theory.

      Thank you for this nice summary. This reviewer also notes that the text related to figures 6 and 7 are more secondary to the main story presented in figures 3-5. The main motivation for including figures 6 and 7 were to demonstrate the wide-ranging applications of sequencing-based neutralization data, and this can certainly be clarified in minor text revisions.

    1. Author Response

      Public Reviews

      We thank both reviewers for taking the time and effort to think critically about our paper and point out areas where it can be improved. In this document, we do our best to clarify any misunderstandings with the hope that further consideration about the strengths and weaknesses of our approach will be possible. Our responses are in bold.

      Reviewer #1 (Public Review):

      Summary:

      In their manuscript, Schmidlin, Apodaca, et al try to answer fundamental questions about the evolution of new phenotypes and the trade-offs associated with this process. As a model, they use yeast resistance to two drugs, fluconazole and radicicol. They use barcoded libraries of isogenic yeasts to evolve thousands of strains in 12 different environments. They then measure the fitness of evolved strains in all environments and use these measurements to examine patterns in fitness trade-offs. They identify only six major clusters corresponding to different trade-off profiles, suggesting the vast genotypic landscape of evolved mutants translates to a highly constrained phenotypic space. They sequence over a hundred evolved strains and find that mutations in the same gene can result in different phenotypic profiles.

      Overall, the authors deploy innovative methods to scale up experimental evolution experiments, and in many aspects of their approach tried to minimize experimental variation.

      We thank the reviewer for this positive assessment of our work. We are happy that the reviewer noted what we feel is a unique strength of our approach: we scaled up experimental evolution by using DNA barcodes and by exploring 12 related selection pressures. Despite this scaling up, we still see phenotypic convergence among the 744 adaptive mutants we study.

      The environments we study represent 12 different concentrations or combinations of two drugs, radicicol and fluconazole. Our hope is that this large dataset (774 mutants x 12 environments) will be useful, both to scientists who are generally interested in the genetic and phenotypic underpinnings of adaptation, and to scientists specifically interested in the evolution of drug resistance.

      Weaknesses:

      (1) One of the objectives of the authors is to characterize the extent of phenotypic diversity in terms of resistance trade-offs between fluconazole and radicicol. To minimize noise in the measurement of relative fitness, the authors only included strains with at least 500 barcode counts across all time points in all 12 experimental conditions, resulting in a set of 774 lineages passing this threshold. This corresponds to a very small fraction of the starting set of ~21 000 lineages that were combined after experimental evolution for fitness measurements.

      This is a misunderstanding that we will work to clarify in the revision. Our starting set did not include 21,000 adaptive lineages. The total number of unique adaptive lineages in this starting set is much lower than 21,000 for two reasons.

      First, ~21,000 represents the number of single colonies we isolated in total from our evolution experiments. Many of these isolates possess the same barcode, meaning they are duplicates. Second, and more importantly, most evolved lineages do not acquire adaptive mutations, meaning that many of the 21,000 isolates are genetically identical to their ancestor. In our revised manuscript, we will explicitly state that these 21,000 isolated lineages do not all represent unique, adaptive lineages. In figure 2 and all associated text, we will change the word “lineages” to “isolates,” where relevant.

      More broadly speaking, several previous studies have demonstrated that diverse genetic mutations converge at the level of phenotype, and have suggested that this convergence makes adaptation more predictable (PMID33263280, PMID37437111, PMID22282810, PMID25806684). Our study captures mutants that are overlooked in previous studies, such as those that emerge across subtly different selection pressures (e.g., 4 𝜇g/ml vs. 8 𝜇g/ml flu) and those that are undetectable in evolutions lacking DNA barcodes. Thus, while our experimental design misses some mutants (see next comment), it captures many others. Note that 774 adaptive lineages is more than most previous studies. Thus, we feel that “our work – showing that 774 mutants fall into a much smaller number of groups” is important because it “contributes to growing literature suggesting that the phenotypic basis of adaptation is not as diverse as the genetic basis (lines 161 - 162).”

      As the authors briefly remark, this will bias their datasets for lineages with high fitness in all 12 environments, as all these strains must be fit enough to maintain a high abundance.

      The word “briefly” feels a bit unfair because we discuss this bias on 3 separate occasions (on lines 146 - 147, 260 - 264, and in more detail on 706 - 714). We even walk through an example of a class of mutants that our study misses. We say, “our study is underpowered to detect adaptive lineages that have low fitness in any of the 12 environments. This is bound to exclude large numbers of adaptive mutants. For example, previous work has shown some FLU resistant mutants have strong tradeoffs in RAD (Cowen and Lindquist 2005). Perhaps we are unable to detect these mutants because their barcodes are at too low a frequency in RAD environments, thus they are excluded from our collection of 774.”

      In our revised version, we will add more text to the first mention of these missing mutants (lines 146 - 147) so that the implications are more immediately made apparent.

      While we “miss” some classes of mutants, we “catch” other classes that may have been missed in previous studies of convergence. For example, we observe a unique class of FLU-resistant mutants that primarily emerged in evolution experiments that lack FLU (Figure 3). Thus, we think that the unique design of our study, surveying 12 environments, allows us to make a novel contribution to the study of phenotypic convergence.

      One of the main observations of the authors is phenotypic space is constrained to a few clusters of roughly similar relative fitness patterns, giving hope that such clusters could be enumerated and considered to design antimicrobial treatment strategies. However, by excluding all lineages that fit in only one or a few environments, they conceal much of the diversity that might exist in terms of trade-offs and set up an inclusion threshold that might present only a small fraction of phenotypic space with characteristics consistent with generalist resistance mechanisms or broadly increased fitness. This has important implications regarding the general conclusions of the authors regarding the evolution of trade-offs.

      We discussed these implications in some detail in the 16 lines mentioned above (146 - 147, 260 - 264, 706 - 714). To add to this discussion, we will also add the following sentence to the end of the paragraph on lines 697 - 714: “This could complicate (or even make impossible) endeavors to design antimicrobial treatment strategies that thwart resistance”.

      We will also add a new paragraph that discusses these implications earlier in our manuscript. This paragraph will highlight the strengths of our method (e.g., that we “catch” classes of mutants that are often overlooked) while being transparent about the weaknesses of our approach (e.g., that we “miss” mutants with strong tradeoffs).

      (2) Most large-scale pooled competition assays using barcodes are usually stopped after ~25 to avoid noise due to the emergence of secondary mutations.

      The rate at which new mutations enter a population is driven by various factors such as the mutation rate and population size, so choosing an arbitrary threshold like 25 generations is difficult.

      We conducted our fitness competition following previous work using the Levy/Blundell yeast barcode system, in which the number of generations reported varies from 32 to 40 (PMID33263280, PMID27594428, PMID37861305, see PMID27594428 for detailed calculation of the fraction of lineages biased by secondary mutations in this system).

      The authors measure fitness across ~40 generations, which is almost the same number of generations as in the evolution experiment. This raises the possibility of secondary mutations biasing abundance values, which would not have been detected by the whole genome sequencing as it was performed before the competition assay.

      We understand how the reviewer came to this misunderstanding and will adjust our revised manuscript accordingly. Previous work has demonstrated that, in this particular evolution platform, most of the mutations actually occur during the transformation that introduces the DNA barcodes (PMID25731169). In other words, these mutations do not accumulate during the 40 generations of evolution, they are already there. So the observation that we collect a genetically diverse pool of adaptive mutants after 40 generations of evolution is not evidence that 40 generations is enough time for secondary mutations to bias abundance values.

      (3) The approach used by the authors to identify and visualize clusters of phenotypes among lineages does not seem to consider the uncertainty in the measurement of their relative fitness. As can be seen from Figure S4, the inter-replicate difference in measured fitness can often be quite large. From these graphs, it is also possible to see that some of the fitness measurements do not correlate linearly (ex.: Med Flu, Hi Rad Low Flu), meaning that taking the average of both replicates might not be the best approach.

      This concern, and all subsequent concerns, seem to be driven by either (a) general concerns about the noisiness of fitness measurements obtained from large-scale barcode fitness assays or (b) general concerns about whether the clusters obtained from our dimensional reduction approach capture this noise as opposed to biologically meaningful differences.

      We will respond to each concern point-by-point, but want to start by generally stating that (a) our particular large-scale barcode fitness assay has several features that diminish noise, and (b) we devote 4 figures and 200 lines of text to demonstrating that these clusters capture biologically meaningful differences between mutants (and not noise).

      In terms of this specific concern, we performed an analysis of noise in the submitted manuscript: Our noisiest fitness measurements correspond to barcodes that are the least abundant and thus suffer the most from stochastic sampling noise. These are also the barcodes that introduce the nonlinearity the reviewer mentions. We removed these from our dataset by increasing our coverage threshold from 500 reads to 5,000 reads. The clusters did not collapse, which suggests that they were not capturing noise (Figure S7 panel B). But we agree with the reviewer that this analysis alone is not sufficient to conclude that the clusters distinguish groups of mutants with unique fitness tradeoffs.

      Because the clustering approach used does not seem to take this variability into account, it becomes difficult to evaluate the strength of the clustering, especially because the UMAP projection does not include any representation of uncertainty around the position of lineages.

      To evaluate the strength of the clustering, we performed numerous analyses including whole genome sequencing, growth experiments, reclustering, and tracing the evolutionary origins of each cluster (Figures 5 - 8). All of these analyses suggested that our clusters capture groups of mutants that have different fitness tradeoffs. We will adjust our revised manuscript to make clear that we do not rely on the results of a clustering algorithm alone to draw conclusions about phenotypic convergence.

      We are also grateful to the reviewer for helping us realize that, as written, our manuscript is not clear with regard to how we perform clustering. We are not using UMAP to decide which mutant belongs to which cluster. Recent work highlights the importance of using an independent clustering method (PMID37590228). Although this recent work addresses the challenge of clustering much higher dimensional data than we survey here, we did indeed use an independent clustering method (gaussian mixture model). In other words, we use UMAP for visualization but not clustering. We also confirm our clustering results using a second independent method (hierarchical clustering; Figure S8). And in our revised manuscript, will confirm with a third method (PCA, see below). We will adjust the main text and the methods section to make these choices clearer.

      This might paint a misleading picture where clusters appear well separate and well defined but are in fact much fuzzier, which would impact the conclusion that the phenotypic space is constricted.

      The salient question is whether the clusters are so “fuzzy” that they are not meaningful. That interpretation seems unreasonable. Our clusters group mutants with similar genotypes, evolutionary histories, and fitness tradeoffs (Figures 5 - 8). Clustering mutants with similar behaviors is important and useful. It improves phenotypic prediction by revealing which mutants are likely to have at least some phenotypic effects in common. And it also suggests that the phenotypic space is constrained, at least to some degree, which previous work suggests is helpful in predicting evolution (PMID33263280, PMID37437111, PMID22282810, PMID25806684).

      (4) The authors make the decision to use UMAP and a gaussian mixed model to cluster and represent the different fitness landscapes of their lineages of interest. Their approach has many caveats. First, compared to PCA, the axis does not provide any information about the actual dissimilarities between clusters. Using PCA would have allowed a better understanding of the amount of variance explained by components that separate clusters, as well as more interpretable components.

      The components derived from PCA are often not interpretable. It’s not obvious that each one, or even the first one, will represent some intuitive phenotype, like resistance to fluconazole.

      Moreover, we see many non-linearities in our data. For example, fitness in a double drug environment is not predicted by adding up fitness in the relevant single drug environments. Also, there are mutants that have high fitness when fluconazole is absent or abundant, but low fitness when mild concentrations are present. These types of nonlinearities can make the axes in PCA very difficult to interpret, plus these nonlinearities can be missed by PCA, thus we prefer other clustering methods.

      We will adjust our revised manuscript to explain these reasons why we chose UMAP and GMM over PCA.

      Also, we will include PCA in the supplement of our revised manuscript. Please find below PC1 vs PC2, with points colored according to the cluster assignment in figure 4 (i.e. using a gaussian mixture model). It appears the clusters are largely preserved.

      Author response image 1.

      Second, the advantages of dimensional reduction are not clear. In the competition experiment, 11/12 conditions (all but the no drug, no DMSO conditions) can be mapped to only three dimensions: concentration of fluconazole, concentration of radicicol, and relative fitness. Each lineage would have its own fitness landscape as defined by the plane formed by relative fitness values in this space, which can then be examined and compared between lineages.

      We worry that the idea stems from apriori notions of what the important dimensions should be. It also seems like this would miss important nonlinearities such as our observation that low fluconazole behaves more like a novel selection pressure than a dialed down version of high fluconazole.

      Also, we believe the reviewer meant “fitness profile” and not “fitness landscape”. A fitness landscape imagines a walk where every “step” is a mutation. Most lineages in barcoded evolution experiments possess only a single adaptive mutation. A single-step walk is not enough to build a landscape, though others are expanding barcoded evolution experiments beyond the first step (PMID34465770, PMID31723263), so maybe one day this will be possible.

      Third, the choice of 7 clusters as the cutoff for the multiple Gaussian model is not well explained. Based on Figure S6A, BIC starts leveling off at 6 clusters, not 7, and going to 8 clusters would provide the same reduction as going from 6 to 7. This choice also appears arbitrary in Figure S6B, where BIC levels off at 9 clusters when only highly abundant lineages are considered.

      We agree. We did not rely on the results of BIC alone to make final decisions about how many clusters to include. We thank the reviewer for pointing out this gap in our writing. We will adjust our revised manuscript to explain that we ultimately chose to describe 6 clusters that we were able to validate with follow-up experiments. In figures 5, 6, 7, and 8, we use external information to validate the clusters that we report in figure 4. And in lines 697 – 714, we explain that there are may be additional clusters beyond those we tease apart in this study.

      This directly contradicts the statement in the main text that clusters are robust to noise, as more a stringent inclusion threshold appears to increase and not decrease the optimal number of clusters. Additional criteria to BIC could have been used to help choose the optimal number of clusters or even if mixed Gaussian modeling is appropriate for this dataset.

      We are under the following impression: If our clustering method was overfitting, i.e. capturing noise, the optimal number of clusters should decrease when we eliminate noise. It increased. In other words, the observation that our clusters did not collapse (i.e. merge) when we removed noise suggests these clusters were not capturing noise.

      More generally, our validation experiments, described below, provide additional evidence that our clusters capture meaningful differences between mutants (and not noise).

      (5) Large-scale barcode sequencing assays can often be noisy and are generally validated using growth curves or competition assays.

      Some types of bar-seq methods, in particular those that look at fold change across two time points, are noisier than others that look at how frequency changes across multiple timepoints (PMID30391162). Here, we use the less noisy method. We also reduce noise by using a stricter coverage threshold than previous work (e.g., PMID33263280), and by excluding batch effects by performing all experiments simultaneously (PMID37237236).

      The main assay we use to measure fitness has been previously validated (PMID27594428). No subsequent study using this assay validates using the methods suggested by the reviewer (see PMID37861305, PMID33263280, PMID31611676, PMID29429618, PMID37192196, PMID34465770, PMID33493203).

      More to the point, bar-seq has been used, without the reviewer’s suggested validation, to demonstrate that the way some mutant’s fitness changes across environments is different from other mutants (PMID33263280, PMID37861305, PMID31611676, PMID33493203, PMID34596043). This is the same thing that we use bar-seq to demonstrate.

      For all of these reasons, we are hesitant to confirm bar-seq itself as a valid way to infer fitness. It seems this is already accepted as a standard in our field.

      Having these types of results would help support the accuracy of the main assay in the manuscript and thus better support the claims of the authors.

      We don’t agree that fitness measurements obtained from this bar-seq assay generally require validation. But we do agree that it is important to validate whether the mutants in each of our 6 clusters indeed are different from one another in meaningful ways, in particular, in that they have different fitness tradeoffs. We have four figures (5 - 8) and 200 lines of text dedicated to validating whether our clusters capture reproducible and biologically meaningful differences between mutants. Happily, one of these figures (Fig 7) includes growth curves, which are exactly the type of validation experiment asked for by the reviewer.

      Below, we walk through the different types of validation experiments that are present in our original manuscript, and additional validation experiments that we plan to include in the revised version. We are hopeful that these validation experiments are sufficient, or at the very least, that this list empowers reviewers to point out where more work is needed.

      (1) Mutants from different clusters have different growth curves: In our original manuscript, we measured growth curves corresponding to a fitness tradeoff that we thought was surprising. Mutants in clusters 4 and 5 both have fitness advantages in single drug conditions. While mutants from cluster 4 also are advantageous in the double drug conditions, mutants from cluster 5 are not! We validated these different behaviors by studying growth curves for a mutant from each cluster (Figures 7 and S10).

      (2) Mutants from different clusters have different evolutionary origins: In our original manuscript, we came up with a novel way to ask whether the clusters capture different types of adaptive mutants. We asked whether the mutants in each cluster originate from different evolution experiments. Indeed they often do (see pie charts in Figures 6, 7, 8). This method also provides evidence supporting each cluster’s differing fitness tradeoffs.

      For example, mutants in cluster 5 appear to have a tradeoff in a double drug condition (described above). They rarely originate from that evolution condition, unlike mutants in nearby cluster 4 (see Figure 7).

      (3) Mutants from each cluster often fall into different genes: In our original manuscript, we sequenced many of these mutants and show that mutants in the same gene are often found in the same cluster. For example, all 3 IRA1 mutants are in cluster 6 (Fig 8), both GPB2 mutants are in cluster 4 (Figs 7 & 8), and 35/36 PDR mutants are in either cluster 2 or 3 (Figs 5 & 6).

      (4) Mutants from each cluster have behaviors previously observed in the literature: In our original manuscript, we compared our sequencing results to the literature and found congruence. For example, PDR mutants are known to provide a fitness benefit in fluconazole and are found in clusters that have high fitness in fluconazole (lines 457 - 462). Previous work suggests that some mutations to PDR have different tradeoffs than others, which is what we see (lines 540 - 542). IRA1 mutants were previously observed to have high fitness in our “no drug” condition, and are found in the cluster that has the highest fitness in the “no drug” condition (lines 642 - 646). Previous work even confirms the unusual fitness tradeoff we observe where IRA1 and other cluster 6 mutants have low fitness only in low concentrations of fluconazole (lines 652 - 657).

      (5) Mutants largely remain in their clusters when we use alternate clustering methods: In our original manuscript, we performed various different reclustering and/or normalization approaches on our data (Fig 6, S5, S7, S8, S9). The clusters of mutants that we observe in figure 4 do not change substantially when we recluster the data. We will add PCA (see above) to these analyses in our revised manuscript.

      (6) We will include additional data showing that mutants in different clusters have different evolutionary origins: Cluster 1 is defined by high fitness in low fluconazole that declines with increasing fluconazole (see Fig 4E and Fig 5C). In our revised manuscript, we will show that cluster 1 lineages were overwhelmingly sampled from evolutions conducted in our lowest concentration of fluconazole (see figure panel A below). No other cluster’s evolutionary history shows this pattern (figures 6, 7, and 8).

      (7) We will include additional data showing that mutants in different clusters have different growth curves: Cluster 1 lineages are unique in that their fitness advantage is specific to low flu and trades off in higher concentrations of fluconazole. We obtained growth curves for three cluster 1 mutants (2 SUR1 mutants and 1 UPC2 mutant). We compared them to growth curves for three PDR mutants (from clusters 2 and 3). Cluster 1 mutants appear to have the highest growth rates and reach the higher carrying capacity in low fluconazole (see red and green lines in Author response image 2 panel B below). But the cluster 1 mutants are negatively affected by higher concentrations of fluconazole, much more so than the mutants from clusters 2 and 3 (see Author response image 2 panel C below). This is consistent with the different fitness tradeoffs we observe for each cluster (figures 4 and 5). We will include a more detailed version of this analysis and the figures below in our revised manuscript.

      Author response image 2.

      Validation experiments demonstrate that cluster 1 mutants have uniquely high fitness in only the lowest concentration of fluconazole. (A) The mutant lineages in cluster 1 were largely sampled from evolution experiments performed in low flu. This is not true of other clusters (see pie charts in main manuscript). (B) In low flu (4 𝜇g/ml), Cluster 1 lineages (red/UPC2 and green/SUR1) grow faster and achieve higher density than lineages from clusters 2 and 3 (blue/PDR). This is consistent with barseq measurements demonstrating that cluster 1 mutants have the highest fitness in low flu. (C) Cluster 1 lineages are sensitive to increasing flu concentrations (SUR1 and UPC2 mutants, middle and rightmost graphs). This is apparent in that the gray (8 𝜇g/ml flu) and light blue (32 𝜇g/ml flu) growth curves rise more slowly and reach lower density than the dark blue curves (4 𝜇g/ml flu). But this is not the case for the PDR mutants from clusters 2 and 3 (leftmost graph). These observations are consistent with the bar-seq fitness data presented in the main manuscript (Fig 4E).

      With all of these validation efforts combined, we are hopeful that the reviewer is now more convinced that our clusters capture groups of mutants with different fitness tradeoffs (as opposed to noise). We want to conclude by saying that we are grateful to the reviewer for making us think deeply about areas where we can include additional validation efforts as well as areas where we can make our manuscript clearer.

      Reviewer #2 (Public Review):

      Summary:

      Schmidlin & Apodaca et al. aim to distinguish mutants that resist drugs via different mechanisms by examining fitness tradeoffs across hundreds of fluconazole-resistant yeast strains. They barcoded a collection of fluconazole-resistant isolates and evolved them in different environments with a view to having relevance for evolutionary theory, medicine, and genotypephenotype mapping.

      Strengths:

      There are multiple strengths to this paper, the first of which is pointing out how much work has gone into it; the quality of the experiments (the thought process, the data, the figures) is excellent. Here, the authors seek to induce mutations in multiple environments, which is a really large-scale task. I particularly like the attention paid to isolates with are resistant to low concentrations of FLU. So often these are overlooked in favour of those conferring MIC values >64/128 etc. What was seen is different genotype and fitness profiles. I think there's a wealth of information here that will actually be of interest to more than just the fields mentioned (evolutionary medicine/theory).

      We are very grateful for this positive review. This was indeed a lot of work! We are happy that the reviewer noted what we feel is a unique strength of our manuscript: that we survey adaptive isolates across multiple environments, including low drug concentrations.

      Weaknesses:

      Not picking up low fitness lineages - which the authors discuss and provide a rationale as to why. I can completely see how this has occurred during this research, and whilst it is a shame I do not think this takes away from the findings of this paper. Maybe in the next one!

      We thank the reviewer for these words of encouragement and will work towards catching more low fitness lineages in our next project.

      In the abstract the authors focus on 'tradeoffs' yet in the discussion they say the purpose of the study is to see how many different mechanisms of FLU resistance may exist (lines 679-680), followed up by "We distinguish mutants that likely act via different mechanisms by identifying those with different fitness tradeoffs across 12 environments". Whilst I do see their point, and this is entirely feasible, I would like a bit more explanation around this (perhaps in the intro) to help lay-readers make this jump. The remainder of my comments on 'weaknesses' are relatively fixable, I think:

      We think that phrasing the “jump” as a question might help lay readers get from point A to point B. So, in the introduction of our revised manuscript, we will add a paragraph roughly similar to this one: “If two groups of drug-resistant mutants have different fitness tradeoffs, does it mean that they provide resistance through different underlying mechanisms? Alternatively, it could mean that both provide drug resistance via the same mechanism, but some mutations come with a cost that others don’t pay. However, another way to phrase this alternative is to say that both groups of mutants affect fitness through different suites of mechanisms that are only partially overlapping. And so, by identifying groups of mutants with different fitness tradeoffs, we argue that we will be uncovering sets of mutations that impact fitness through different underlying mechanisms. The ability to do so would be useful for genotype-phenotype mapping endeavors.”

      In the introduction I struggle to see how this body of research fits in with the current literature, as the literature cited is a hodge-podge of bacterial and fungal evolution studies, which are very different! So example, the authors state "previous work suggests that mutants with different fitness tradeoffs may affect fitness through different molecular mechanisms" (lines 129-131) and then cite three papers, only one of which is a fungal research output. However, the next sentence focuses solely on literature from fungal research. Citing bacterial work as a foundation is fine, but as you're using yeast for this I think tailoring the introduction more to what is and isn't known in fungi would be more appropriate. It would also be great to then circle back around and mention monotherapy vs combination drug therapy for fungal infections as a rationale for this study. The study seems to be focused on FLU-resistant mutants, which is the first-line drug of choice, but many (yeast) infections have acquired resistance to this and combination therapy is the norm.

      In our revised manuscript, we will carefully review all citations. The issue may stem from our attempt to reach two different groups of scientists. We ourselves are broadly interested in the structure of the genotype-phenotype-fitness map (PMID33263280, PMID32804946). Though the 3 papers the reviewer mentions on lines 132 - 133 all pertain to yeast, we cite them because they are studies about the complexity of this map. Their conclusions, in theory, should apply broadly, beyond yeast. Similarly, the reason we cite papers from yeast, as well as bacteria and cancer, is that we believe general conclusions about the genotype-phenotype-fitness map should apply broadly. For example, the sentence the reviewer highlights, “previous work suggests that mutants with different fitness tradeoffs may affect fitness through different molecular mechanisms” is a general observation about the way genotype maps to fitness. So we cited papers from across the tree of life to support this sentence.

      On the other hand, because we study drug resistant mutations, we also hope that our work is of use to scientists studying the evolution of resistance. We agree with the reviewer that in this regard, some of our findings may be especially pertinent to the evolution of resistance to antifungal drugs. We will consider this when reviewing the citations in our revised manuscript and add some text to clarify these points.

      Methods: Line 769 - which yeast? I haven't even seen mention of which species is being used in this study; different yeast employ different mechanisms of adaptation for resistance, so could greatly impact the results seen. This could help with some background context if the species is mentioned (although I assume S. cerevisiae).

      In the revised manuscript, we will make clear that we study S. cerevisiae.

      In which case, should aneuploidy be considered as a mechanism? This is mentioned briefly on line 556, but with all the sequencing data acquired this could be checked quickly?

      We like this idea and we are working on it, but it is not straightforward. The reviewer is correct in that we can use the sequencing data that we already have. But calling aneuploidy with certainty is tough because its signal can be masked by noise. In other words, some regions of the genome may be sequenced more than others by chance. Given this is not straightforward, at least not for us, this analysis will likely have to wait for a subsequent paper.

      I think the authors could be bolder and try and link this to other (pathogenic) yeasts. What are the implications of this work on say, Candida infections?

      Perhaps because our background lies in general study of the genotype-phenotype map, we did not want to make bold assertions about how our work might apply to pathogenic yeasts. But we see how this could be helpful and will add some discussion points about this. Specifically, we will discuss which of the genes and mutants we observe are also found in Candida. We will also investigate whether our observation that low fluconazole represents a seemingly unique challenge, not just a milder version of high fluconazole, has any corollary in the Candida literature.

    1. Author response:

      We thank the reviewers for their thorough reading and thoughtful feedback. Below, we provisionally address each of the concerns raised in the public reviews, and outline our planned revision that aims to further clarify and strengthen the manuscript.

      In our response, we clarify our conceptualization of elasticity as a dimension of controllability, formalizing it within an information-theoretic framework, and demonstrating that controllability and its elasticity are partially dissociable. Furthermore, we provide clarifications and additional modeling results showing that our experimental design and modeling approach are well-suited to dissociating elasticity inference from more general learning processes, and are not inherently biased to find overestimates of elasticity. Finally, we clarify the advantages and disadvantages of our canonical correlation analysis (CCA) approach for identifying latent relationships between multidimensional data sets, and provide additional analyses that strengthen the link between elasticity estimation biases and a specific psychopathology profile.

      Reviewer 1:

      This research takes a novel theoretical and methodological approach to understanding how people estimate the level of control they have over their environment, and how they adjust their actions accordingly. The task is innovative and both it and the findings are well-described (with excellent visuals). They also offer thorough validation for the particular model they develop. The research has the potential to theoretically inform the understanding of control across domains, which is a topic of great importance.

      We thank the reviewer for their favorable appraisal and valuable suggestions, which have helped clarify and strengthen the study’s conclusion. 

      An overarching concern is that this paper is framed as addressing resource investments across domains that include time, money, and effort, and the introductory examples focus heavily on effort-based resources (e.g., exercising, studying, practicing). The experiments, though, focus entirely on the equivalent of monetary resources - participants make discrete actions based on the number of points they want to use on a given turn. While the same ideas might generalize to decisions about other kinds of resources (e.g., if participants were having to invest the effort to reach a goal), this seems like the kind of speculation that would be better reserved for the Discussion section rather than using effort investment as a means of introducing a new concept (elasticity of control) that the paper will go on to test.

      We thank the reviewer for pointing out a lack of clarity regarding the kinds of resources tested in the present experiment. Investing additional resources in the form of extra tickets did not only require participants to pay more money. It also required them to invest additional time – since each additional ticket meant making another attempt to board the vehicle, extending the duration of the trial, and attentional effort – since every attempt required precisely timing a spacebar press as the vehicle crossed the screen. Given this involvement of money, time, and effort resources, we believe it would be imprecise to present the study as concerning monetary resources in particular. That said, we agree with the Reviewer that results might differ depending on the resource type that the experiment or the participant considers most. Thus, in our revision of the manuscript, we will make sure to clarify the kinds of resources the experiment involved, and highlight the open question of whether inferences concerning the elasticity of control generalize across different resource domains.

      Setting aside the framing of the core concepts, my understanding of the task is that it effectively captures people's estimates of the likelihood of achieving their goal (Pr(success)) conditional on a given investment of resources. The ground truth across the different environments varies such that this function is sometimes flat (low controllability), sometimes increases linearly (elastic controllability), and sometimes increases as a step function (inelastic controllability). If this is accurate, then it raises two questions.

      First, on the modeling front, I wonder if a suitable alternative to the current model would be to assume that the participants are simply considering different continuous functions like these and, within a Bayesian framework, evaluating the probabilistic evidence for each function based on each trial's outcome. This would give participants an estimate of the marginal increase in Pr(success) for each ticket, and they could then weigh the expected value of that ticket choice (Pr(success)*150 points) against the marginal increase in point cost for each ticket. This should yield similar predictions for optimal performance (e.g., opt-out for lower controllability environments, i.e., flatter functions), and the continuous nature of this form of function approximation also has the benefit of enabling tests of generalization to predict changes in behavior if there was, for instance, changes in available tickets for purchase (e.g., up to 4 or 5) or changes in ticket prices. Such a model would of course also maintain a critical role for priors based on one's experience within the task as well as over longer timescales, and could be meaningfully interpreted as such (e.g., priors related to the likelihood of success/failure and whether one's actions influence these). It could also potentially reduce the complexity of the model by replacing controllability-specific parameters with multiple candidate functions (presumably learned through past experience, and/or tuned by experience in this task environment), each of which is being updated simultaneously.

      Second, if the reframing above is apt (regardless of the best model for implementing it), it seems like the taxonomy being offered by the authors risks a form of "jangle fallacy," in particular by positing distinct constructs (controllability and elasticity) for processes that ultimately comprise aspects of the same process (estimation of the relationship between investment and outcome likelihood). Which of these two frames is used doesn't bear on the rigor of the approach or the strength of the findings, but it does bear on how readers will digest and draw inferences from this work. It is ultimately up to the authors which of these they choose to favor, but I think the paper would benefit from some discussion of a common-process alternative, at least to prevent too strong of inferences about separate processes/modes that may not exist. I personally think the approach and findings in this paper would also be easier to digest under a common-construct approach rather than forcing new terminology but, again, I defer to the authors on this.

      We thank the reviewer for suggesting this interesting alternative modeling approach. We agree that a Bayesian framework evaluating different continuous functions could offer advantages, particularly in its ability to generalize to other ticket quantities and prices. We will attempt to implement this as an alternative model and compare it with the current model.  

      We also acknowledge the importance of avoiding a potential "jangle fallacy". We entirely agree with the Reviewer that elasticity and controllability inferences are not distinct processes. Specifically, we view resource elasticity as a dimension of controllability, hence the name of our ‘elastic controllability’ model. In response to this and other Reviewers’ comments, we now offer a formal definition of elasticity as the reduction in uncertainty about controllability due to knowing the amount of resources the agent is able and willing to invest (see further details in response to Reviewer 3 below).  

      With respect to how this conceptualization is expressed in the modelling, we note that the representation in our model of maximum controllability and its elasticity via different variables is analogous to how a distribution may be represented by separate mean and variance parameters. Ultimately, even in the model suggested by the Reviewer, there would need to be a dedicated variable representing elasticity, such as the probability of sloped controllability functions. A single-process account thus allows that different aspects of this process would be differently biased (e.g., one can have an accurate estimate of the mean of a distribution but overestimate its variance). Therefore, our characterization of distinct elasticity and controllability biases (or to put it more accurately, ‘elasticity of controllability bias’ and ‘maximum controllability bias’) is consistent with a common construct account. 

      That said, given the Reviewer’s comments, we believe that some of the terminology we used may have been misleading. In our planned revision, we will modify the text to clarify that we view elasticity as a dimension of controllability that can only be estimated in conjunction with controllability. 

      Reviewer 2:

      This research investigates how people might value different factors that contribute to controllability in a creative and thorough way. The authors use computational modeling to try to dissociate "elasticity" from "overall controllability," and find some differential associations with psychopathology. This was a convincing justification for using modeling above and beyond behavioral output and yielded interesting results. Interestingly, the authors conclude that these findings suggest that biased elasticity could distort agency beliefs via maladaptive resource allocation. Overall, this paper reveals some important findings about how people consider components of controllability.

      We appreciate the Reviewer's positive assessment of our findings and computational approach to dissociating elasticity and overall controllability.

      The primary weakness of this research is that it is not entirely clear what is meant by "elastic" and "inelastic" and how these constructs differ from existing considerations of various factors/calculations that contribute to perceptions of and decisions about controllability. I think this weakness is primarily an issue of framing, where it's not clear whether elasticity is, in fact, theoretically dissociable from controllability. Instead, it seems that the elements that make up "elasticity" are simply some of the many calculations that contribute to controllability. In other words, an "elastic" environment is inherently more controllable than an "inelastic" one, since both environments might have the same level of predictability, but in an "elastic" environment, one can also partake in additional actions to have additional control overachieving the goal (i.e., expend effort, money, time).

      We thank the reviewer for highlighting the lack of clarity in our concept of elasticity. We first clarify that elasticity cannot be entirely dissociated from controllability because it is a dimension of controllability. If no controllability is afforded, then there cannot be elasticity or inelasticity. This is why in describing the experimental environments, we only label high-controllability, but not low-controllability, environments as ‘elastic’ or ‘inelastic’. For further details on this conceptualization of elasticity, and a planned revision of the text, see our response above to Reviewer 1. 

      Second, we now clarify that controllability can also be computed without knowing the amount of resources the agent is able and willing to invest, for instance by assuming infinite resources available or a particular distribution of resource availabilities. However, knowing the agent’s available resources often reduces uncertainty concerning controllability. This reduction in uncertainty is what we define as elasticity. Since any action requires some resources, this means that no controllable environment is entirely inelastic if we also consider agents that do not have enough resources to commit any action. However, even in this case environments can differ in the degree to which they are elastic. For further details on this formal definition, see our response to Reviewer 3 below. We will make these necessary clarifications in the revised manuscript. 

      Importantly, whether an environment is more or less elastic does not determine whether it is more or less controllable. In particular, environments can be more controllable yet less elastic. This is true even if we allow that investing different levels of resources (i.e., purchasing 0, 1, 2, or 3 tickets) constitute different actions, in conjunction with participants’ vehicle choices. Below, we show this using two existing definitions of controllability. 

      Definition 1, reward-based controllability<sup>1</sup>: If control is defined as the fraction of available reward that is controllably achievable, and we assume all participants are in principle willing and able to invest 3 tickets, controllability can be computed in the present task as:

      where P(S' \= goal ∣ 𝑆, 𝐴, 𝐶 ) is the probability of reaching the treasure from present state 𝑆 when taking action A and investing C resources in executing the action. In any of the task environments, the probability of reaching the goal is maximized by purchasing 3 tickets (𝐶 = 3) and choosing the vehicle that leads to the goal (𝐴 = correct vehicle). Conversely, the probability of reaching the goal is minimized by purchasing 3 tickets (𝐶 = 3) and choosing the vehicle that does not lead to the goal (𝐴 = wrong vehicle). This calculation is thus entirely independent of elasticity, since it only considers what would be achieved by maximal resource investment, whereas elasticity consists of the reduction in controllability that would arise if the maximal available 𝐶 is reduced. Consequently, any environment where the maximum available control is higher yet varies less with resource investment would be more controllable and less elastic. 

      Note that if we also account for ticket costs in calculating reward, this will only reduce the fraction of achievable reward and thus the calculated control in elastic environments.   

      Definition 2, information-theoretic controllability<sup>2</sup>: Here controllability is defined as the reduction in outcome entropy due to knowing which action is taken:

      I(S'; A, C | S) = H(S'|S) - H(S'|S, A, C)

      where H(S'|S) is the conditional entropy of the distribution of outcomes S' given the present state 𝑆, and H(S'|S, A, C) is the conditional entropy of the outcome given the present state, action, and resource investment. 

      To compare controllability, we consider two environments with the same maximum control:

      • Inelastic environment: If the correct vehicle is chosen, there is a 100% chance of reaching the goal state with 1, 2, or 3 tickets. Thus, out of 7 possible action-resource investment combinations, three deterministically lead to the goal state (≥1 tickets and correct vehicle choice), three never lead to it (≥1 tickets and wrong vehicle choice), and one (0 tickets) leads to it 20% of the time (since walking leads to the treasure on 20% of trials).

      • Elastic Environment: If the correct vehicle is chosen, the probability of boarding it is 0% with 1 ticket, 50% with 2 tickets, and 100% with 3 tickets. Thus, out of 7 possible actionresource investment combinations, one deterministically leads to the goal state (3 tickets and correct vehicle choice), one never leads to it (3 tickets and wrong vehicle choice), one leads to it 60% of the time (2 tickets and correct vehicle choice: 50% boarding + 50% × 20% when failing to board), one leads to it 10% of time (2 ticket and wrong vehicle choice), and three lead to it 20% of time (0-1 tickets).

      Here we assume a uniform prior over actions, which renders the information-theoretic definition of controllability equal to another definition termed ‘instrumental divergence’3,4. We note that changing the uniform prior assumption would change the results for the two environments, but that would not change the general conclusion that there can be environments that are more controllable yet less elastic. 

      Step 1: Calculating H(S'|S)

      For the inelastic environment:

      P(goal) = (3 × 100% + 3 × 0% + 1 × 20%)/7 = .46, P(non-goal) = .54  H(S'|S) = – [.46 × log<sub>2</sub>(.46) + .54 × log<sub>2</sub>(.54)] \= 1 bit

      For the elastic environment:

      P(goal) \= (1 × 100% + 1 × 0% + 1 × 60% + 1 × 10% + 3 × 20%)/7 \= .33, P(non-goal) \= .67  H(S'|S) = – [.33 × log<sub>2</sub>(.33) + .67 × log<sub>2</sub>(.67)] \= .91 bits

      Step 2: Calculating H(S'|S, A, C)

      Inelastic environment: Six action-resource investment combinations have deterministic outcomes entailing zero entropy, whereas investing 0 tickets has a probabilistic outcome (20%). The entropy for 0 tickets is: H(S'|C \= 0) \= -[.2 × log<sub>2</sub>(.2) + 0.8 × log<sub>2</sub> (.8)] = .72 bits. Since this actionresource investment combination is chosen with probability 1/7, the total conditional entropy is approximately .10 bits

      Elastic environment: 2 actions have deterministic outcomes (3 tickets with correct/wrong vehicle), whereas the other 5 actions have probabilistic outcomes:

      2 tickets and correct vehicle (60% success): 

      H(S'|A = correct, C = 2) = – [.6 × log<sub>2</sub>(.6) + .4 × log<sub>2</sub>(.4)] \= .97 bits 2 tickets and wrong vehicle (10% success): 

      H(S'|A = wrong, C = 2) = – [.1 × <sub>2</sub>(.1) + .9 × <sub>2</sub>(.9)] \= .47 bits 0-1 tickets (20% success):

      H(S'|C = 0-1) = – [.2 × <sub>2</sub>(.2) + .8 × <sub>2</sub> .8)] \= .72 bits

      Thus the total conditional entropy of the elastic environment is: H(S'|S, A, C) = (1/7) × .97 + (1/7) × .47 + (3/7) × .72 \= .52 bits

      Step 3: Calculating I(S' | A, S)  

      Inelastic environment: I(S'; A, C | S) = H(S'|S) – H(S'|S, A, C) = 1 – 0.1 = .9 bits 

      Elastic environment: I(S'; A, C | S) = H(S'|S) – H(S'|S, A, C) = .91 – .52 = .39 bits

      Thus, the inelastic environment offers higher information-theoretic controllability (.9 bits) compared to the elastic environment (.39 bits). 

      Of note, even if each combination of cost and goal reaching is defined as a distinct outcome, then information-theoretic controllability is higher for the inelastic (2.81 bits) than for the elastic (2.30 bits) environment. 

      In sum, for both definitions of controllability, we see that environments can be more elastic yet less controllable. We will amend the manuscript to clarify this distinction between controllability and its elasticity.

      Reviewer 3:

      A bias in how people infer the amount of control they have over their environment is widely believed to be a key component of several mental illnesses including depression, anxiety, and addiction. Accordingly, this bias has been a major focus in computational models of those disorders. However, all of these models treat control as a unidimensional property, roughly, how strongly outcomes depend on action. This paper proposes---correctly, I think---that the intuitive notion of "control" captures multiple dimensions in the relationship between action and outcome is multi-dimensional. In particular, the authors propose that the degree to which outcome depends on how much *effort* we exert, calling this dimension the "elasticity of control". They additionally propose that this dimension (rather than the more holistic notion of controllability) may be specifically impaired in certain types of psychopathology. This idea thus has the potential to change how we think about mental disorders in a substantial way, and could even help us better understand how healthy people navigate challenging decision-making problems.

      Unfortunately, my view is that neither the theoretical nor empirical aspects of the paper really deliver on that promise. In particular, most (perhaps all) of the interesting claims in the paper have weak empirical support.

      We appreciate the Reviewer's thoughtful engagement with our research and recognition of the potential significance of distinguishing between different dimensions of control in understanding psychopathology. We believe that all the Reviewer’s comments can be addressed with clarifications or additional analyses, as detailed below.  

      Starting with theory, the elasticity idea does not truly "extend" the standard control model in the way the authors suggest. The reason is that effort is simply one dimension of action. Thus, the proposed model ultimately grounds out in how strongly our outcomes depend on our actions (as in the standard model). Contrary to the authors' claims, the elasticity of control is still a fixed property of the environment. Consistent with this, the computational model proposed here is a learning model of this fixed environmental property. The idea is still valuable, however, because it identifies a key dimension of action (namely, effort) that is particularly relevant to the notion of perceived control. Expressing the elasticity idea in this way might support a more general theoretical formulation of the idea that could be applied in other contexts. See Huys & Dayan (2009), Zorowitz, Momennejad, & Daw (2018), and Gagne & Dayan (2022) for examples of generalizable formulations of perceived control.

      We thank the Reviewer for the suggestion that we formalize our concept of elasticity to resource investment, which we agree is a dimension of action. We first note that we have not argued against the claim that elasticity is a fixed property of the environment. We surmise the Reviewer might have misread our statement that “controllability is not a fixed property of the environment”. The latter statement is motivated by the observation that controllability is often higher for agents that can invest more resources (e.g., a richer person can buy more things). We will clarify this in our revision of the manuscript.

      To formalize elasticity, we build on Huys & Dayan’s definition of controllability(1) as the fraction of reward that is controllably achievable, 𝜒 (though using information-theoretic definitions(2,3) would work as well). To the extent that this fraction depends on the amount of resources the agent is able and willing to invest (max 𝐶), this formulation can be probabilistically computed without information about the particular agent involved, specifically, by assuming a certain distribution of agents with different amounts of available resources. This would result in a probability distribution over 𝜒. Elasticity can thus be defined as the amount of information obtained about controllability due to knowing the amount of resources available to the agent: I(𝜒; max 𝐶). We will add this formal definition to the manuscript.  

      Turning to experiment, the authors make two key claims: (1) people infer the elasticity of control, and (2) individual differences in how people make this inference are importantly related to psychopathology. Starting with claim 1, there are three sub-claims here; implicitly, the authors make all three. (1A) People's behavior is sensitive to differences in elasticity, (1B) people actually represent/track something like elasticity, and (1C) people do so naturally as they go about their daily lives. The results clearly support 1A. However, 1B and 1C are not supported. Starting with 1B, the experiment cannot support the claim that people represent or track elasticity because the effort is the only dimension over which participants can engage in any meaningful decision-making (the other dimension, selecting which destination to visit, simply amounts to selecting the location where you were just told the treasure lies). Thus, any adaptive behavior will necessarily come out in a sensitivity to how outcomes depend on effort. More concretely, any model that captures the fact that you are more likely to succeed in two attempts than one will produce the observed behavior. The null models do not make this basic assumption and thus do not provide a useful comparison.

      We appreciate the reviewer's critical analysis of our claims regarding elasticity inference, which as detailed below, has led to an important new analysis that strengthens the study’s conclusions. However, we respectfully disagree with two of the Reviewer’s arguments. First, resource investment was not the only meaningful decision dimension in our task, since participant also needed to choose the correct vehicle to get to the right destination. That this was not trivial is evidenced by our exclusion of over 8% of participants who made incorrect vehicle choices more than 10% of the time. Included participants also occasionally erred in this choice (mean error rate = 3%, range [0-10%]). 

      Second, the experimental task cannot be solved well by a model that simply tracks how outcomes depend on effort because 20% of the time participants reached the treasure despite failing to board their vehicle of choice. In such cases, reward outcomes and control were decoupled. Participants could identify when this was the case by observing the starting location, which was revealed together with the outcome (since depending on the starting location, the treasure location was automatically reached by walking). To determine whether participants distinguished between control-related and non-control-related reward, we have now fitted a variant of our model to the data that allows learning from each of these kinds of outcomes by means of a different free parameter. The results show that participants learned considerably more from control-related outcomes. They were thus not merely tracking outcomes, but specifically inferred when outcomes can be attributed to control. We will include this new analysis in the revised manuscript.

      Controllability inference by itself, however, still does not suffice to explain the observed behavior. This is shown by our ‘controllability’ model, which learns to invest more resources to improve control, yet still fails to capture key features of participants’ behavior, as detailed in the manuscript. This means that explaining participants’ behavior requires a model that not only infers controllability—beyond merely outcome probability—but also assumes a priori that increased effort could enhance control. Building these a priori assumption into the model amounts to embedding within it an understanding of elasticity – the idea that control over the environment may be increased by greater resource investment. 

      That being said, we acknowledge the value in considering alternative computational formulations of adaptation to elasticity. Thus, in our revision of the manuscript, we will add a discussion concerning possible alternative models.  

      For 1C, the claim that people infer elasticity outside of the experimental task cannot be supported because the authors explicitly tell people about the two notions of control as part of the training phase: "To reinforce participants' understanding of how elasticity and controllability were manifested in each planet, [participants] were informed of the planet type they had visited after every 15 trips." (line 384).

      We thank the reviewer for highlighting this point. We agree that our experimental design does not test whether people infer elasticity spontaneously. Our research question was whether people can distinguish between elastic and inelastic controllability. The results strongly support that they can, and this does have potential implications for behavior outside of the experimental task. Specifically, to the extent that people are aware that in some contexts additional resource investment improve control, whereas in other contexts it does not, then our results indicate that they would be able to distinguish between these two kinds of contexts through trial-and-error learning. That said, we agree that investigating whether and how people spontaneously infer elasticity is an interesting direction for future work. We will clarify the scope of the present conclusions in the revised manuscript.

      Finally, I turn to claim 2, that individual differences in how people infer elasticity are importantly related to psychopathology. There is much to say about the decision to treat psychopathology as a unidimensional construct. However, I will keep it concrete and simply note that CCA (by design) obscures the relationship between any two variables. Thus, as suggestive as Figure 6B is, we cannot conclude that there is a strong relationship between Sense of Agency and the elasticity bias---this result is consistent with any possible relationship (even a negative one). The fact that the direct relationship between these two variables is not shown or reported leads me to infer that they do not have a significant or strong relationship in the data.

      We agree that CCA is not designed to reveal the relationship between any two variables. However, the advantage of this analysis is that it pulls together information from multiple variables. Doing so does not treat psychopathology as unidimensional. Rather, it seeks a particular dimension that most strongly correlates with different aspects of task performance. This is especially useful for multidimensional psychopathology data because such data are often dominated by strong correlations between dimensions, whereas the research seeks to explain the distinctions between the dimensions. Similar considerations hold for the multidimensional task parameters, which although less correlated, may still jointly predict the relevant psychopathological profile better than each parameter does in isolation. Thus, the CCA enabled us to identify a general relationship between task performance and psychopathology that accounts for different symptom measures and aspects of controllability inference. 

      Using CCA can thus reveal relationships that do not readily show up in two-variable analyses. Indeed, the direct correlation between Sense of Agency (SOA) and elasticity bias was not significant – a result that, for completeness, we will now report in the supplementary materials along with all other direct correlations. We note, however, that the CCA analysis was preregistered and its results were replicated. Furthermore, an auxiliary analysis specifically confirmed the contributions of both elasticity bias (Figure 6D, bottom plot) and, although not reported in the original paper, of the Sense of Agency score (SOA; p\=.03 permutation test) to the observed canonical correlation. Participants scoring higher on the psychopathology profile also overinvested resources in inelastic environments but did not futilely invest in uncontrollable environments (Figure 6A), providing external validation to the conclusion that the CCA captured meaningful variance specific to elasticity inference. The results thus enable us to safely conclude that differences in elasticity inferences are significantly associated with a profile of controlrelated psychopathology to which SOA contributed significantly.  

      Finally, whereas interpretation of individual CCA loadings that were not specifically tested remains speculative, we note that the pattern of loadings largely replicated across the initial and replication studies (see Figure 6B), and aligns with prior findings. For instance, the positive loadings of SOA and OCD match prior suggestions that a lower sense of control leads to greater compensatory effort(7), whereas the negative loading for depression scores matches prior work showing reduced resource investment in depression(5-6).

      We will revise the text to better clarify the advantageous and disadvantageous of our analytical approach, and the conclusions that can and cannot be drawn from it.

      There is also a feature of the task that limits our ability to draw strong conclusions about individual differences in elasticity inference. As the authors clearly acknowledge, the task was designed "to be especially sensitive to overestimation of elasticity" (line 287). A straightforward consequence of this is that the resulting *empirical* estimate of estimation bias (i.e., the gamma_elasticity parameter) is itself biased. This immediately undermines any claim that references the directionality of the elasticity bias (e.g. in the abstract). Concretely, an undirected deficit such as slower learning of elasticity would appear as a directed overestimation bias. When we further consider that elasticity inference is the only meaningful learning/decisionmaking problem in the task (argued above), the situation becomes much worse. Many general deficits in learning or decision-making would be captured by the elasticity bias parameter. Thus, a conservative interpretation of the results is simply that psychopathology is associated with impaired learning and decision-making.

      We apologize for our imprecise statement that the task was ‘especially sensitive to overestimation of elasticity’, which justifiably led to Reviewer’s concern that slower elasticity learning can be mistaken for elasticity bias. To make sure this was not the case, we made use of the fact that our computational model explicitly separates bias direction (λ) from the rate of learning through two distinct parameters, which initialize the prior concentration and mean of the model’s initial beliefs concerning elasticity (see Methods pg. 22). The higher the concentration of the initial beliefs (𝜖), the slower the learning. Parameter recovery tests confirmed that our task enables acceptable recovery of both the bias λ<sub>elasticity</sub> (r=.81) and the concentration 𝝐<sub>elasticity</sub> (r=.59) parameters. And importantly, the level of confusion between the parameters was low (confusion of 0.15 for 𝝐<sub>elasticity</sub>→ λ<sub>elasticity</sub> and 0.04 for λ<sub>elasticity</sub>→ 𝝐<sub>elasticity</sub>). This result confirms that our task enables dissociating elasticity biases from the rate of elasticity learning. 

      Moreover, to validate that the minimal level of confusion existing between bias and the rate of learning did not drive our psychopathology results, we re-ran the CCA while separating concentration from bias parameters. The results (Author response image 1) demonstrate that differences in learning rate (𝜖) had virtually no contribution to our CCA results, whereas the contribution of the pure bias (𝜆) was preserved. 

      We will incorporate these clarifications and additional analysis in our revised manuscript.

      Author response image 1.

      Showing that a model parameter correlates with the data it was fit to does not provide any new information, and cannot support claims like "a prior assumption that control is likely available was reflected in a futile investment of resources in uncontrollable environments." To make that claim, one must collect independent measures of the assumption and the investment.

      We apologize if this and related statements seemed to be describing independent findings. They were merely meant to describe the relationship between model parameters and modelindependent measures of task performance. It is inaccurate, though, to say that they provide no new information, since results could have been otherwise. For instance, instead of a higher controllability bias primarily associating with futile investment of resources in uncontrollable environments, it could have been primarily associated with more proper investment of resources in high-controllability environments. Additionally, we believe these analyses are of value to readers who seek to understand the role of different parameters in the model. In our planned revision, we will clarify that the relevant analyses are merely descriptive. 

      Did participants always make two attempts when purchasing tickets? This seems to violate the intuitive model, in which you would sometimes succeed on the first jump. If so, why was this choice made? Relatedly, it is not clear to me after a close reading how the outcome of each trial was actually determined.

      We thank the reviewer for highlighting the need to clarify these aspects of the task in the revised manuscript. 

      When participants purchased two extra tickets, they attempted both jumps, and were never informed about whether either of them succeeded. Instead, after choosing a vehicle and attempting both jumps, participants were notified where they arrived at. This outcome was determined based on the cumulative probability of either of the two jumps succeeding. Success meant that participants arrived at where their chosen vehicle goes, whereas failure meant they walked to the nearest location (as determined by where they started from). 

      Though it is unintuitive to attempt a second jump before seeing whether the first succeed, this design choice ensured two key objectives. First, that participants would consistently need to invest not only more money but also more effort and time in planets with high elastic controllability. Second, that the task could potentially generalize to the many real-world situations where the amount of invested effort has to be determined prior to seeing any outcome, for instance, preparing for an exam or a job interview. 

      It should be noted that the model is heuristically defined and does not reflect Bayesian updating. In particular, it overestimates control by not using losses with less than 3 tickets (intuitively, the inference here depends on your beliefs about elasticity). I wonder if the forced three-ticket trials in the task might be historically related to this modeling choice.

      We apologize for not making this clear, but in fact losing with less than 3 tickets does reduce the model’s estimate of available control. It does so by increasing the elasticity estimates

      (a<sub>elastic≥1</sub>, a<sub>elastic2</sub> parameters), signifying that more tickets are needed to obtain the maximum available level of control, thereby reducing the average controllability estimate across ticket investment options. 

      It would be interesting to further develop the model such that losing with less than 3 tickets would also impact inferences concerning the maximum available control, depending on present beliefs concerning elasticity, but the forced three-ticket purchases already expose participants to the maximum available control, and thus, the present data may not be best suited to test such a model. These trials were implemented to minimize individual differences concerning inferences of maximum available control, thereby focusing differences on elasticity inferences. We will discuss the Reviewer’s suggestion for a potentially more accurate model in the revised manuscript. 

      References

      (1) Huys, Q. J. M., & Dayan, P. (2009). A Bayesian formulation of behavioral control. Cognition, 113(3), 314– 328.

      (2) Ligneul, R. (2021). Prediction or causation? Towards a redefinition of task controllability. Trends in Cognitive Sciences, 25(6), 431–433.

      (3) Mistry, P., & Liljeholm, M. (2016). Instrumental divergence and the value of control. Scientific Reports, 6, 36295.

      (4) Lin, J. (1991). Divergence measures based on the Shannon entropy. IEEE Transactions on Information Theory, 37(1), 145–151

      (5) Cohen RM, Weingartner H, Smallberg SA, Pickar D, Murphy DL. Effort and cognition in depression. Arch Gen Psychiatry. 1982 May;39(5):593-7. doi: 10.1001/archpsyc.1982.04290050061012. PMID: 7092490.

      (6) Bi R, Dong W, Zheng Z, Li S, Zhang D. Altered motivation of effortful decision-making for self and others in subthreshold depression. Depress Anxiety. 2022 Aug;39(8-9):633-645. doi: 10.1002/da.23267. Epub 2022 Jun 3. PMID: 35657301; PMCID: PMC9543190.

      (7) Tapal, A., Oren, E., Dar, R., & Eitam, B. (2017). The Sense of Agency Scale: A measure of consciously perceived control over one's mind, body, and the immediate environment. Frontiers in Psychology, 8, 1552

    1. Author response: 

      We thank the reviewers for their feedback on our paper. We have taken all their comments into account in revising the manuscript. We provide a point-by-point response to their comments, below.

      Reviewer #1:

      Major comments:

      The manuscript is clearly written with a level of detail that allows others to reproduce the imaging and cell-tracking pipeline. Of the 22 movies recorded one was used for cell tracking. One movie seems sufficient for the second part of the manuscript, as this manuscript presents a proof-of-principle pipeline for an imaging experiment followed by cell tracking and molecular characterisation of the cells by HCR. In addition, cell tracking in a 5-10 day time-lapse movie is an enormous time commitment.

      My only major comment is regarding "Suppl_data_5_spineless_tracking". The image file does not load.

      It looks like the wrong file is linked to the mastodon dataset. The "Current BDV dataset path" is set to "Beryl_data_files/BLB mosaic cut movie-02.xml", but this file does not exist in the folder. Please link it to the correct file.

      We have corrected the file path in the updated version of Suppl. Data 5.

      Minor comments:

      The authors state that their imaging settings aim to reduce photo damage. Do they see cell death in the regenerating legs? Is the cell death induced by the light exposure or can they tell if the same cells die between the movies? That is, do they observe cell death in the same phases of regeneration and/or in the same regions of the regenerating legs?

      Yes, we observe cell death during Parhyale leg regeneration. We have added the following sentence to explain this in the revised manuscript: "During the course of regeneration some cells undergo apoptosis (reported in Alwes et al., 2016). Using the H2B-mRFPruby marker, apoptotic cells appear as bright pyknotic nuclei that break up and become engulfed by circulating phagocytes (see bright specks in Figure 2F)."

      We now also document apoptosis in regenerated legs that have not been subjected to live imaging in a new supplementary figure (Suppl. Figure 3),  and we refer to these observations as follows: "While some cell death might be caused by photodamage, apoptosis can also be observed in similar numbers in regenerating legs that have not been subjected to live imaging (Suppl. Figure 3)."

      Based on 22 movies, the authors divide the regeneration process into three phases and they describe that the timing of leg regeneration varies between individuals. Are the phases proportionally the same length between regenerating legs or do the authors find differences between fast/slow regenerating legs? If there is a difference in the proportions, why might this be?

      Both early and late phases contribute to variation in the speed of regeneration, but there is no clear relationship between the relative duration of each phase and the speed of regeneration. We now present graphs supporting these points in a new supplementary figure (Suppl. Figure 2).  

      To clarify this point, we have added the following sentence in the manuscript: "We find that the overall speed of leg regeneration is determined largely by variation in the speed of the early (wound closure) phase of regeneration, and to a lesser extent by variation in later phases when leg morphogenesis takes place (Suppl. Figure 2 A,B). There is no clear relationship between the relative duration of each phase and the speed of regeneration (Suppl. Figure 2 A',B')."

      Based on their initial cell tracing experiment, could the authors elaborate more on what kind of biological information can be extracted from the cell lineages, apart from determining which is the progenitor of a cell? What does it tell us about the cell population in the tissue? Is there indication of multi- or pluripotent stem cells? What does it say about the type of regeneration that is taking place in terms of epimorphosis and morphallaxis, the old concepts of regeneration?

      In the first paragraph of Future Directions we describe briefly the kind of biological information that could be gained by applying our live imaging approach with appropriate cell-type markers (see below). We do not comment further, as we do not currently have this information at hand. Regarding the concepts of epimorphosis and morphallaxis, as we explain in Alwes et al. 2016, these terms describe two extreme conditions that do not capture what we observe during Parhyale leg regeneration. Our current work does not bring new insights on this topic.

      Page 5. The authors mention the possibility of identifying the cell ID based on transcriptomic profiling data. Can they suggest how many and which cell types they expect to find in the last stage based on their transcriptomic data?

      We have added this sentence: "Using single-nucleus transcriptional profiling, we have identified approximately 15 transcriptionally-distinct cell types in adult Parhyale legs (Almazán et al., 2022), including epidermis, muscle, neurons, hemocytes, and a number of still unidentified cell types."

      Page 6. Correction: "..molecular and other makers.." should be "..molecular and other markers.."

      Corrected

      Page 8. The HCR in situ protocol probably has another important advantage over the conventional in situ protocol, which is not mentioned in this study. The hybridisation step in HCR is performed at a lower temperature (37˚C) than in conventional in situ hybridisation (65˚C, Rehm et al., 2009). In other organisms, a high hybridisation temperature affects the overall tissue morphology and cell location (tissue shrinkage). A lower hybridisation temperature has less impact on the tissue and makes manual cell alignment between the live imaging movie and the fixed HCR in situ stained specimen easier and more reliable. If this is also the case in Parhyale, the authors must mention it.

      This may be correct, but all our specimens were treated at 37˚C, so we cannot assess whether hybridisation temperature affects morphological preservation in our specimens.

      Page 9. The authors should include more information on the spineless study. What been is spineless? What do the cell lineages tell about the spineless progenitors, apart from them being spread in the tissue at the time of amputation? Do spineless progenitors proliferate during regeneration? Do any spineless expressing cells share a common progenitor cell?

      We now point out that spineless encodes a transcription factor. We provide a summary of the lineages generating spineless-expressing cells in Suppl. Figure 6, and we explain that "These epidermal progenitors undergo 0, 1 or 2 cell divisions, and generate mostly spineless-expressing cells (Suppl. Figure 5)."

      Page 10. Regarding the imaging temperature, the Materials and Methods state "... a temperature control chamber set to 26 or 27˚C..."; however, in Suppl. Data 1, 26˚C and 29˚C are indicated as imaging temperatures. Which is correct?

      We corrected the Methods by adding "with the exception of dataset li51, imaged at 29°C"

      Page 10. Regarding the imaging step size, the Materials and Methods state "...step size of 1-2.46 µm..."; however, Suppl. Data 1 indicate a step size between 1.24 - 2.48 µm. Which is correct?

      We corrected the Methods.

      Page 11. Correct "...as the highest resolution data..." to "...at the highest resolution data..."

      The original text is correct ("standardised to the same dimensions as the highest resolution data").

      Page 11. Indicate which supplementary data set is referred to: "Using Mastodon, we generated ground truth annotations on the original image dataset, consisting of 278 cell tracks, including 13,888 spots and 13,610 links across 55 time points (see Supplementary Data)."

      Corrected

      p. 15. Indicate which supplementary data set is referred to: "In this study we used HCR probes for the Parhyale orthologues of futsch (MSTRG.441), nompA (MSTRG.6903) and spineless (MSTRG.197), ordered from Molecular Instruments (20 oligonucleotides per probe set). The transcript sequences targeted by each probe set are given in the Supplementary Data."

      Corrected

      Figure 3. Suggestion to the overview schematics: The authors might consider adding "molting" as the end point of the red bar (representing differentiation).

      The time of molting is not known in the majority of these datasets, because the specimens were fixed and stained prior to molting. We added the relevant information in the figure legend: "Datasets li-13 and li-16 were recorded until the molt; the other recordings were stopped before molting."

      Figure 4B': Please indicate that the nuclei signal is DAPI.

      Corrected

      Supplementary figure 1A. Word is missing in the figure legend: ...the image also shows weak…

      Corrected

      Supplementary Figure 2: Please indicate the autofluorescence in the granular cells. Does it correspond to the yellow cells?

      Corrected

      Video legend for video 1 and 2. Please correct "H2B-mREFruby" to "H2B-mRFPruby".

      Corrected

      Reviewer #2:

      Major comments:

      MC 1. Given that most of the technical advances necessary to achieve the work described in this manuscript have been published previously, it would be helpful for the authors to more clearly identify the primary novelty of this manuscript. The abstract and introduction to the manuscript focus heavily on the technical details of imaging and analysis optimization and some additional summary of the implications of these advances should be included here to aid the reader.

      This paper describes a technical advance. While previous work (Alwes et al. 2016) established some key elements of our live imaging approach, we were not at that time able to record the entire time course of leg regeneration (the longest recordings were 3.5 days long). Here we present a method for imaging the entire course of leg regeneration (up to 10 days of imaging), optimised to reduce photodamage and to improve cell tracking. We also develop a method of in situ staining in cuticularised adult legs (an important technical breakthrough in this experimental system), which we combine with live imaging to determine the fate of tracked cells. We have revised the abstract and introduction of the paper to point out these novelties, in relation to our previous publications.

      In the abstract we explain: "Building on previous work that allowed us to image different parts of the process of leg regeneration in the crustacean Parhyale hawaiensis, we present here a method for live imaging that captures the entire process of leg regeneration, spanning up to 10 days, at cellular resolution. Our method includes (1) mounting and long-term live imaging of regenerating legs under conditions that yield high spatial and temporal resolution but minimise photodamage, (2) fixing and in situ staining of the regenerated legs that were imaged, to identify cell fates, and (3) computer-assisted cell tracking to determine the cell lineages and progenitors of identified cells. The method is optimised to limit light exposure while maximising tracking efficiency."

      The introduction includes the following text: "Our first systematic study using this approach presented continuous live imaging over periods of 2-3 days, capturing key events of leg regeneration such as wound closure, cell proliferation and morphogenesis of regenerating legs with single-cell resolution (Alwes et al., 2016). Here, we extend this work by developing a method for imaging the entire course of leg regeneration, optimised to reduce photodamage and to improve cell tracking. We also develop a method of in situ staining of gene expression in cuticularised adult legs, which we combine with live imaging to determine the fate of tracked cells."

      MC 2. The description of the regeneration time course is nicely detailed but also very qualitative. A major advantage of continuous recording and automated cell tracking in the manner presented in this manuscript would be to enable deeper quantitative characterization of cellular and tissue dynamics during regeneration. Rather than providing movies and manually annotated timelines, some characterization of the dynamics of the regeneration process (the heterogeneity in this is very very interesting, but not analyzed at all) and correlating them against cellular behaviors would dramatically increase the impact of the work and leverage the advances presented here. For example, do migration rates differ between replicates? Division rates? Division synchrony? Migration orientation? This seems to be an incredibly rich dataset that would be fascinating to explore in greater detail, which seems to me to be the primary advance presented in this manuscript. I can appreciate that the authors may want to segregate some biological findings from the method, but I believe some nominal effort highlighting the quantitative nature of what this method enables would strengthen the impact of the paper and be useful for the reader. Selecting a small number of simple metrics (eg. Division frequency, average cell migration speed) and plotting them alongside the qualitative phases of the regeneration timeline that have already been generated would be a fairly modest investment of effort using tools that already exist in the Mastodon interface, I would roughly estimate on the order of an hour or two per dataset. I believe that this effort would be well worth it and better highlight a major strength of the approach.

      The primary goal of this work was to establish a robust method for continuous long-term live imaging of regeneration, but we do appreciate that a more quantitative analysis would add value to the data we are presenting. We tried to address this request in three steps:

      First, we examined whether clear temporal patterns in cell division, cell movements or other cellular features can be observed in an accurately tracked dataset (li13-t4, tracked in Sugawara et al. 2022). To test this we used the feature extraction functions now available on the Mastodon platform (see link). We could discern a meaningful temporal pattern for cell divisions (see below); the other features showed no interpretable pattern of variation.

      Second, we asked whether we could use automated cell tracking to analyse the patterns of cell division in all our datasets. Using an Elephant deep learning model trained on the tracks of the li13-t4 dataset, we performed automated cell tracking in the same dataset, and compared the pattern of cell divisions from the automated cell track predictions with those coming from manually validated cell tracks. We observed that the automated tracks gave very imprecise results, with a high background of false positives obscuring the real temporal pattern (see images below, with validated data on the left, automated tracking on the right). These results show that the automated cell tracking is not accurate enough to provide a meaningful picture on the pattern of cell divisions.

      Third, we tried to improve the accuracy of detection of dividing cells by additional training of Elephant models on each dataset (to lower the rate of false positives), followed by manual proofreading. Given how labour intensive this is, we could only apply this approach to 4 additional datasets. The results of this analysis are presented in Figure 4.

      Author response image 1.

      MC 3. The authors describe the challenges faced by their described approach:

      Using this mode of semi-automated and manual cell tracking, we find that most cells in the upper slices of our image stacks (top 30 microns) can be tracked with a high degree of confidence. A smaller proportion of cell lineages are trackable in the deeper layers.

      Given that the authors quantify this in Table 1, it would aid the reader to provide metrics in the manuscript text at this point. Furthermore, the metrics provided in Table 1 appear to be for overall performance, but the text describes that performance appears to be heavily depth dependent. Segregating the performance metrics further, for example providing DET, TRA, precision and recall for superficial layers only and for the overall dataset, would help support these arguments and better highlight performance a potential adopter of the method might expect.

      In the revised manuscript we have added data on the tracking performance of Elephant in relation to imaging depth in Suppl. Figure 3. These data confirm our original statement (which was based on manual tracking) that nuclei are more challenging to track in deeper layers.

      We point to these new results in two parts of the paper, as follows: "A smaller proportion of cells are trackable in the deeper layers (see Suppl. Figure 3)", and "Our results, summarised in Table 1A, show that the detection of nuclei can be enhanced by doubling the z resolution at the expense of xy resolution and image quality. This improvement is particularly evident in the deeper layers of the imaging stacks, which are usually the most challenging to track (Suppl. Figure 3)."

      MC 4. Performance characterization in Table 1 appears to derive from a single dataset that is then subsampled and processed in different ways to assess the impact of these changes on cell tracking and detection performance. While this is a suitable strategy for this type of optimization it leaves open the question of performance consistency across datasets. I fully recognize that this type of quantification can be onerous and time consuming, but some attempt to assess performance variability across datasets would be valuable. Manual curation over a short time window over a random sampling of the acquired data would be sufficient to assess this.

      We think that similar trade-offs will apply to all our datasets because tracking performance is constrained by the same features, which are intrinsic to our system; e.g. by the crowding of nuclei in relation to axial resolution, or the speed of mitosis in relation to the temporal resolution of imaging. We therefore do not see a clear rationale for repeating this analysis. On a practical level, our existing image datasets could not be subsampled to generate the various conditions tested in Table 1, so proving this point experimentally would require generating new recordings, and tracking these to generate ground truth data. This would require months of additional work.

      A second, related question is whether Elephant would perform equally well in detecting and tracking nuclei across different datasets. This point has been addressed in the Sugawara et al. 2022 paper, where the performance of Elephant was tested on diverse fluorescence datasets.

      Reviewer #3:

      Major comments:

      • The authors should clearly specify what are the key technical improvements compared to their previous studies (Alwes et al. 2016, Elife; Konstantinides & Averof 2014, Science). There, the approaches for mounting, imaging, and cell tracking are already introduced, and the imaging is reported to run for up to 7 days in some cases.

      In Konstantinides and Averof (2014) we did not present any live imaging at cellular resolution. In Alwes et al. (2016) we described key elements of our live imaging approach, but we were never able to record the entire time course of leg regeneration. The longest recordings in that work were 3.5 days long.

      We have revised the abstract and introduction to clarify the novelty of this work, in relation to our previous publications. Please see our response to comment MC1 of reviewer 2.

      • While the authors mention testing the effect of imaging parameters (such as scanning speed and line averaging) on the imaging/tracking outcome, very little or no information is provided on how this was done beyond the parameters that they finally arrived to.

      Scan speed and averaging parameters were determined by measuring contrast and signal-to-noise ratios in images captured over a range of settings. We have now added these data in Supplementary Figure 1.

      • The authors claim that, using the acquired live imaging data across entire regeneration time course, they are now able to confirm and extend their description of leg regeneration. However, many claims about the order and timing of various cellular events during regeneration are supported only by references to individual snapshots in figures or supplementary movies. Presenting a more quantitative description of cellular processes during regeneration from the acquired data would significantly enhance the manuscript and showcase the usefulness of the improved workflow.

      The events we describe can be easily observed in the maximum projections, available in Suppl. Data 2. Regarding the quantitative analysis, please see our response to comment MC2 of reviewer 2.  

      • Table 1 summarizes the performance of cell tracking using simulated datasets of different quality. However only averages and/or maxima are given for the different metrics, which makes it difficult to evaluate the associated conclusions. In some cases, only 1 or 2 test runs were performed.

      The metrics extracted from each of the three replicates, per dataset, are now included in Suppl. Data 4.

      We consistently used 3 replicates to measure tracking performance with each of the datasets. The "replicates" column label in Table 1 referred to the number of scans that were averaged to generate the image, not to the replicates used for estimating the tracking performance. To avoid confusion, we changed that label to "averaging".

      • OPTIONAL: An imaging approach that allows using the current mounting strategy but could help with some of the tradeoffs is using a spinning-disk confocal microscope instead of a laser scanning one. If the authors have such a system available, it could be interesting to compare it with their current scanning confocal setup.

      Preliminary experiments that we carried out several years ago on a spinning disk confocal (with a 20x objective and the CSU-W1 spinning disk) were not very encouraging, and we therefore did not pursue this approach further. The main problem was bad image quality in deeper tissue layers.

      Minor comments:

      • The presented imaging protocol was optimized for one laser wavelength only (561 nm) - this should be mentioned when discussing the technical limitations since animals tend to react differently to different wavelengths. Same settings might thus not be applicable for imaging a different fluorescent protein.

      In the second paragraph of the Results section, we explain that we perform the imaging at long wavelengths in order to minimise photodamage. It should be clear to the readers that changing the excitation wavelength will have an impact for long-term live imaging.

      • For transferability, it would be useful if the intensity of laser illumination was measured and given in the Methods, instead of just a relative intensity setting from the imaging software. Similarly,more details of the imaging system should be provided where appropriate (e.g., detector specifications).

      We have now measured the intensity of the laser illumination and added this information in the

      Methods: "Laser power was typically set to 0.3% to 0.8%, which yields 0.51 to 1.37 µW at 561 nm (measured with a ThorLabs Microscope Slide Power Sensor, #S170C)."

      Regarding the imaging system and the detector, we provide all the information that is available to us on the microscope's technical sheets.

      • The versions of analysis scripts associated with the manuscript should be uploaded to an online repository that permanently preserves the respective version.

      The scripts are now available on gitbub and online repositories. The relevant links are included in the revised manuscript.

    1. Reviewer #2 (Public Review):

      Summary:

      The goal of the authors in this study is to develop a more reliable approach for quantifying codon usage such that it is more comparable across species. Specifically, the authors wish to estimate the degree of adaptive codon usage, which is potentially a general proxy for the strength of selection at the molecular level. To this end, the authors created the Codon Adaptation Index for Species (CAIS) that controls for differences in amino acid usage and GC% across species. Using their new metric, the authors find a previously unobserved negative correlation between the overall adaptiveness of codon usage and body size across 118 vertebrates. As body size is negatively correlated with effective population size and thus the general strength of natural selection, the negative correlation between CAIS and body size is expected. The authors argue this was previously unobserved due to failures of other popular metrics such as Codon Adaptation Index (CAI) and the Effective Number of Codons (ENC) to adequately control for differences in amino acid usage and GC content across species. Most surprisingly, the authors also find a positive relationship between CAIS and the overall "disorderedness" of a species protein domains. As some of these results are unexpected, which is acknowledged by the authors, I think it would be particularly beneficial to work with some simulated datasets. I think CAIS has the potential to be a valuable tool for those interested in comparing codon adaptation across species in certain situations. However, I have certain theoretical concerns about CAIS as a direct proxy for the efficiency of selection when the mutation bias changes across species.

      Strengths:

      (1) I appreciate that the authors recognize the potential issues of comparing CAI when amino acid usage varies and correct for this in CAIS. I think this is sometimes an under-appreciated point in the codon usage literature, as CAI is a relative measure of codon usage bias (i.e. only considers synonyms). However, the strength of natural selection on codon usage can potentially vary across amino acids, such that comparing mean CAI between protein regions with different amino acid biases may result in spurious signals of statistical significance (see Cope et al. Biochemica et Biophysica Acta - Biomembranes 2018 for a clear example of this).

      (2) The authors present numerous analysis using both ENC and mean CAI as a comparison to CAIS, helping given a sense of how CAIS corrects for some of the issues with these other metrics. I also enjoyed that they examined the previously unobserved relationship between codon usage bias and body size, which has bugged me ever since I saw Kessler and Dean 2014. The result comparing protein disorder to CAIS was particularly interesting and unexpected.

      (3) The CAIS metric presented here is generally applicable to any species that has an annotated genome with protein-coding sequences.

      Weaknesses:

      (1) The main weakness of this work is that it lacks simulated data to confirm that it works as expected. This would be particularly useful for assessing the relationship between CAIS and the overall effect of protein structure disorder, which the authors acknowledge is an unexpected result. I think simulations could also allow the authors to assess how their metric performs in situations where mutation bias and natural selection act in the same direction vs. opposite directions. Additionally, although I appreciate their comparisons to ENC and mean CAI, the lack of comparison to other popular codon metrics for calculating the overall adaptiveness of a genome (e.g. dos Reis et al.'s statistic, which is a function of tRNA Adaptation Index (tAI) and ENC) may be more appropriate. Even if results are similar to , CAIS has a noted advantage that it doesn't require identifying tRNA gene copy numbers or abundances, which I think are generally less readily available than genomic GC% and protein-coding sequences.

      The authors mention the selection-mutation-drift equilibrium model, which underlies the basic ideas of this work (e.g. higher results in stronger selection on codon usage), but a more in-depth framing of CAIS in terms of this model is not given. I think this could be valuable, particularly in addressing the question "are we really estimating what we think we're estimating?"

      Let's take a closer look at the formulation for RSCUS. From here on out, subscripts will only be used to denote the codon and it will be assumed that we are only considering the case of for some species

      I think what the authors are attempting to do is "divide out" the effects of mutation bias (as given by , such that only the effects of natural selection remain, i.e. deviations from the expected frequency based on mutation bias alone represent adaptive codon usage. Consider Gilchrist et al. MBE 2015, which says that the expected frequency of codon at selection-mutation-drift equilibrium in gene for an amino acid with synonymous codons is

      where is the mutation bias, is the strength of selection scaled by the strength of drift, and is the gene expression level of gene \(g\). In this case, \ and reflect the strength and direction of mutation bias and natural selection relative to a reference codon, for which . Assuming the selection-mutation-drift equilibrium model is generally adequate to model the true codon usage patterns in a genome (as I do and I think the authors do, too), the could be considered the expected observed frequency codon in gene .

      Let's re-write the in the form of Gilchrist et al., such that it is a function of mutation bias . For simplicity, we will consider just the two-codon case and assume the amino acid sequence is fixed. Assuming GC% is at equilibrium, the term and can be written as

      where is the mutation rate from nucleotides to. As described in Gilchrist et al. MBE 2015 and Shah and Gilchrist PNAS 2011, the mutation bias . This can be expressed in terms of the equilibrium GC content by recognizing that

      As we are assuming the amino acid sequence is fixed, the probability of observing a synonymous codon at an amino acid becomes just a Bernoulli process.

      If we do this, then

      Recall that in the Gilchrist et al. framework, the reference codon has . Thus, we have recovered the Gilchrist et al. model from the formulation of under the assumption that natural selection has no impact on codon usage and codon NNG is the pre-defined reference codon. To see this, plug in 0 for in equation (1).

      We can then calculate the expected RSCUS using equation (1) (using notation and equation (6) for the two codon case. For simplicity assume, we are only considering a gene of average expression (defined as . Assume in this case that NNG is the reference codon .

      This shows that the expected value of RSCUS for a two-codon amino acid is expected to increase as the strength of selection increases, which is desired. Note that in Gilchrist et al. is formulated in terms of selection against a codon relative to the reference, such that a negative value represents that a codon is favored relative to the reference. If (i.e. selection does not favor either codon), then . Also note that the expected RSCUS does not remain independent of the mutation bias. This means that even if (i.e. the strength of natural selection) does not change between species, changes to the strength and direction of mutation bias across species could impact RSCUS. Assuming my math is right, I think one needs to be cautious when interpreting CAIS as representative of the differences in the efficiency of selection across species except under very particular circumstances. One such case could be when it is known that mutation bias varies little across the species of interest. Looking at the species used in this manuscript, most of them have a GC content ranging around 0.41, so I suspect their results are okay.

      Although I have not done so, I am sure this could be extended to the 4 and 6 codon amino acids.

      Another minor weakness of this work is that although the method is generally applicable to any species with an annotated genome and the code is publicly available, the code itself contains hard-coded values for GC% and amino acid frequencies across the 118 vertebrates. The lack of a more flexible tool may make it difficult for less computationally-experienced researchers to take advantage of this method.

    1. Author response:

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      The image analysis pipeline is tested in analysing microscopy imaging data of gastruloids of varying sizes, for which an optimised protocol for in toto image acquisition is established based on whole mount sample preparation using an optimal refractive index matched mounting media, opposing dual side imaging with two-photon microscopy for enhanced laser penetration, dual view registration, and weighted fusion for improved in toto sample data representation. For enhanced imaging speed in a two-photon microscope, parallel imaging was used, and the authors performed spectral unmixing analysis to avoid issues of signal cross-talk.

      In the image analysis pipeline, different pre-treatments are done depending on the analysis to be performed (for nuclear segmentation - contrast enhancement and normalisation; for quantitative analysis of gene expression - corrections for optical artifacts inducing signal intensity variations). Stardist3D was used for the nuclear segmentation. The study analyses into properties of gastruloid nuclear density, patterns of cell division, morphology, deformation, and gene expression.

      Strengths:

      The methods developed are sound, well described, and well-validated, using a sample challenging for microscopy, gastruloids. Many of the established methods are very useful (e.g. registration, corrections, signal normalisation, lazy loading bioimage visualisation, spectral decomposition analysis), facilitate the development of quantitative research, and would be of interest to the wider scientific community.

      We thank the reviewer for this positive feedback.

      Weaknesses:

      A recommendation should be added on when or under which conditions to use this pipeline.

      We thank the reviewer for this valuable feedback, which will be addressed in the revision. In general, the pipeline is applicable to any tissue, but it is particularly useful for large and dense 3D samples—such as organoids, embryos, explants, spheroids, or tumors—that are typically composed of multiple cell layers and have a thickness greater than 50 µm.

      The processing and analysis pipeline are compatible with any type of 3D imaging data (e.g. confocal, 2 photon, light-sheet, live or fixed).

      - Spectral unmixing to remove signal cross-talk of multiple fluorescent targets is typically more relevant in two-photon imaging due to the broader excitation spectra of fluorophores compared to single-photon imaging. In confocal or light-sheet microscopy, alternating excitation wavelengths often circumvents the need for unmixing. Spectral decomposition performs even better with true spectral detectors; however, these are usually not non-descanned detectors, which are more appropriate for deep tissue imaging. Our approach demonstrates that simultaneous cross-talk-free four-color two-photon imaging can be achieved in dense 3D specimen with four non-descanned detectors and co-excitation by just two laser lines. Depending on the dispersion in optically dense samples, depth-dependent apparent emission spectra need to be considered.

      - Nuclei segmentation using our trained StarDist3D model is applicable to any system under two conditions: (1) the nuclei exhibit a star-convex shape, as required by the StarDist architecture, and (2) the image resolution is sufficient in XYZ to allow resampling. The exact sampling required is object- and system-dependent, but the goal is to achieve nearly isotropic objects with diameters of approximately 15 pixels while maintaining image quality. In practice, images containing objects that are natively close to or larger than 15 pixels in diameter should segment well after resampling. Conversely, images with objects that are significantly smaller along one or more dimensions will require careful inspection of the segmentation results.

      - Normalization is broadly applicable to multicolor data when at least one channel is expected to be ubiquitously expressed within its domain. Wavelength-dependent correction requires experimental calibration using either an ubiquitous signal at each wavelength. Importantly, this calibration only needs to be performed once for a given set of experimental conditions (e.g., fluorophores, tissue type, mounting medium).

      - Multi-scale analysis of gene expression and morphometrics is applicable to any 3D multicolor image. This includes both the 3D visualization tools (Napari plugins) and the various analytical plots (e.g., correlation plots, radial analysis). Multi-scale analysis can be performed even with imperfect segmentation, as long as segmentation errors tend to cancel out when averaged locally at the relevant spatial scale. However, systematic errors—such as segmentation uncertainty along the Z-axis due to strong anisotropy—may accumulate and introduce bias in downstream analyses. Caution is advised when analyzing hollow structures (e.g., curved epithelial monolayers with large cavities), as the pipeline was developed primarily for 3D bulk tissues, and appropriate masking of cavities would be needed.

      Reviewer #2 (Public review):

      Summary:

      This study presents an integrated experimental and computational pipeline for high-resolution, quantitative imaging and analysis of gastruloids. The experimental module employs dual-view two-photon spectral imaging combined with optimized clearing and mounting techniques to image whole-mount immunostained gastruloids. This approach enables the acquisition of comprehensive 3D images that capture both tissue-scale and single-cell level information.

      The computational module encompasses both pre-processing of acquired images and downstream analysis, providing quantitative insights into the structural and molecular characteristics of gastruloids. The pre-processing pipeline, tailored for dual-view two-photon microscopy, includes spectral unmixing of fluorescence signals using depth-dependent spectral profiles, as well as image fusion via rigid 3D transformation based on content-based block-matching algorithms. Nuclei segmentation was performed using a custom-trained StarDist3D model, validated against 2D manual annotations, and achieving an F1 score of 85+/-3% at a 50% intersection-over-union (IoU) threshold. Another custom-trained StarDist3D model enabled accurate detection of proliferating cells and the generation of 3D spatial maps of nuclear density and proliferation probability. Moreover, the pipeline facilitates detailed morphometric analysis of cell density and nuclear deformation, revealing pronounced spatial heterogeneities during early gastruloid morphogenesis.

      All computational tools developed in this study are released as open-source, Python-based software.

      Strengths:

      The authors applied two-photon microscopy to whole-mount deep imaging of gastruloids, achieving in toto visualization at single-cell resolution. By combining spectral imaging with an unmixing algorithm, they successfully separated four fluorescent signals, enabling spatial analysis of gene expression patterns.

      The entire computational workflow, from image pre-processing to segmentation with a custom-trained StarDist3D model and subsequent quantitative analysis, is made available as open-source software. In addition, user-friendly interfaces are provided through the open-source, community-driven Napari platform, facilitating interactive exploration and analysis.

      We thank the reviewer for this positive feedback.

      Weaknesses:

      The computational module appears promising. However, the analysis pipeline has not been validated on datasets beyond those generated by the authors, making it difficult to assess its general applicability.

      We agree that applying our analysis pipeline to published datasets—particularly those acquired with different imaging systems—would be valuable. However, only a few high-resolution datasets of large organoid samples are publicly available, and most of these either lack multiple fluorescence channels or represent 3D hollow structures. Our computational pipeline consists of several independent modules: spectral filtering, dual-view registration, local contrast enhancement, 3D nuclei segmentation, image normalization based on a ubiquitous marker, and multiscale analysis of gene expression and morphometrics.

      Spectral filtering has already been applied in other systems (e.g. [7] and [8]), but is here extended to account for imaging depth-dependent apparent emission spectra of the different fluorophores. In our pipeline, we provide code to run spectral filtering on multichannel images, integrated in Python. In order to apply the spectral filtering algorithm utilized here, spectral patterns of each fluorophore need to be calibrated as a function of imaging depth, which depend on the specific emission windows and detector settings of the microscope.

      Image normalization using a wavelength-dependent correction also requires calibration on a given imaging setup to measure the difference in signal decay among the different fluorophores species. To our knowledge, the calibration procedures for spectral-filtering and our image-normalization approach have not been performed previously in 3D samples, which is why validation on published datasets is not readily possible. Nevertheless, they are described in detail in the Methods section, and the code used—from the calibration measurements to the corrected images—is available open-source at the Zenodo link in the manuscript.

      Dual-view registration, local contrast enhancement, and multiscale analysis of gene expression and morphometrics are not limited to organoid data or our specific imaging modalities. If we identify suitable datasets to validate these modules, we will include them in the revised manuscript.

      To evaluate our 3D nuclei segmentation model, we plan to test it on diverse systems, including gastruloids stained with the nuclear marker Draq5 from Moos et al. [1]; breast cancer spheroids; primary ductal adenocarcinoma organoids; human colon organoids and HCT116 monolayers from Ong et al. [2]; and zebrafish tissues imaged by confocal microscopy from Li et al [3]. These datasets were acquired using either light-sheet or confocal microscopy, with varying imaging parameters (e.g., objective lens, pixel size, staining method).

      Preliminary results are promising (see Author response image 1). We will provide quantitative comparisons of our model’s performance on these datasets, using annotations or reference predictions provided by the original authors where available.

      Author response image 1.

      Qualitative comparison of our custom Stardist3D segmentation strategy on diverse published 3D nuclei datasets. We show one slice from the XY plane for simplicity. (a) Gastruloid stained with the nuclear marker DRAQ5 imaged with an open-top dual-view and dual-illumination LSM [1]. (b) Breast cancer spheroid [2]. (c) Primary pancreatic ductal adenocarcinoma organoids imaged with confocal microscopy[2]. (d) Human colon organoid imaged with LSM laser scanning confocal microscope [2]. (e) Monolayer HCT116 cells imaged with LSM laser scanning confocal microscope [2]. (f) Fixed zebrafish embryo stained for nuclei and imaged with a Zeiss LSM 880 confocal microscopy [3].

      Besides, the nuclei segmentation component lacks benchmarking against existing methods.

      We agree with the reviewer that a benchmark against existing segmentation methods would be very useful. We tried different pre-trained models:

      - CellPose, which we tested in a previous paper ([4]) and which showed poor performances compared to our trained StarDist3D model.

      - DeepStar3D ([2]) is only available in the software 3DCellScope. We could not benchmark the model on our data, because the free and accessible version of the software is limited to small datasets. An image of a single whole-mount gastruloid with one channel, having dimensions (347,467,477) was too large to be processed, see screenshot below. The segmentation model could not be extracted from the source code and tested externally because the trained DeepStar3D weights are encrypted.

      Author response image 2.

      Screenshot of the 3DCellScore software. We could not perform 3D nuclei segmentation of a whole-mount gastruloids because the image size was too large to be processed.

      - AnyStar ([5]), which is a model trained from the StarDist3D architecture, was not performing well on our data because of the heterogeneous stainings. Basic pre-processing such as median and gaussian filtering did not improve the results and led to wrong segmentation of touching nuclei. AnyStar was demonstrated to segment well colon organoids in Ong et al, 2025 ([2]), but the nuclei were more homogeneously stained. Our Hoechst staining displays bright chromatin spots that are incorrectly labeled as individual nuclei.

      - Cellos ([6]), another model trained from StarDist3D, was also not performing well. The objects used for training and to validate the results are sparse and not touching, so the predicted segmentation has a lot of false negatives even when lowering the probability threshold to detect more objects. Additionally, the network was trained with an anisotropy of (9,1,1), based on images with low z resolution, so it performed poorly on almost isotropic images. Adapting our images to the network’s anisotropy results in an imprecise segmentation that can not be used to measure 3D nuclei deformations.

      We tried both Cellos and AnyStar predictions on a gastruloid image from Fig. S2 of our main manuscript. Author response image 3 displays the results qualitatively compared to our trained model Stardist-tapenade. For the revision of the paper, we will perform a comprehensive benchmark of these state-of-the-art routines, including quantitative assessment of the performance.

      Author response image 3.

      Qualitative comparison of two published segmentation models versus our model. We show one slice from the XY plane for simplicity. Segmentations are displayed with their contours only. (Top left) Gastruloid stained with Hoechst, image extracted from Fig S2 of our manuscript. (Top right) Same image overlayed with the prediction from the Cellos model, showing many false negatives. (Bottom left) Same image overlayed with the prediction from our Stardist-tapenade model. (Bottom right) Same image overlayed with the prediction from the AnyStar model, false positives are indicated with a red arrow.

      Appraisal:

      The authors set out to establish a quantitative imaging and analysis pipeline for gastruloids using dual-view two-photon microscopy, spectral unmixing, and a custom computational framework for 3D segmentation and gene expression analysis. This aim is largely achieved. The integration of experimental and computational modules enables high-resolution in toto imaging and robust quantitative analysis at the single-cell level. The data presented support the authors' conclusions regarding the ability to capture spatial patterns of gene expression and cellular morphology across developmental stages.

      Impact and utility:

      This work presents a compelling and broadly applicable methodological advance. The approach is particularly impactful for the developmental biology community, as it allows researchers to extract quantitative information from high-resolution images to better understand morphogenetic processes. The data are publicly available on Zenodo, and the software is released on GitHub, making them highly valuable resources for the community.

      We thank the reviewer for these positive feedbacks.

      Reviewer #3 (Public review):

      Summary

      The paper presents an imaging and analysis pipeline for whole-mount gastruloid imaging with two-photon microscopy. The presented pipeline includes spectral unmixing, registration, segmentation, and a wavelength-dependent intensity normalization step, followed by quantitative analysis of spatial gene expression patterns and nuclear morphometry on a tissue level. The utility of the approach is demonstrated by several experimental findings, such as establishing spatial correlations between local nuclear deformation and tissue density changes, as well as the radial distribution pattern of mesoderm markers. The pipeline is distributed as a Python package, notebooks, and multiple napari plugins.

      Strengths

      The paper is well-written with detailed methodological descriptions, which I think would make it a valuable reference for researchers performing similar volumetric tissue imaging experiments (gastruloids/organoids). The pipeline itself addresses many practical challenges, including resolution loss within tissue, registration of large volumes, nuclear segmentation, and intensity normalization. Especially the intensity decay measurements and wavelength-dependent intensity normalization approach using nuclear (Hoechst) signal as reference are very interesting and should be applicable to other imaging contexts. The morphometric analysis is equally well done, with the correlation between nuclear shape deformation and tissue density changes being an interesting finding. The paper is quite thorough in its technical description of the methods (which are a lot), and their experimental validation is appropriate. Finally, the provided code and napari plugins seem to be well done (I installed a selected list of the plugins and they ran without issues) and should be very helpful for the community.

      We thank the reviewer for his positive feedback and appreciation of our work.

      Weaknesses

      I don't see any major weaknesses, and I would only have two issues that I think should be addressed in a revision:

      (1) The demonstration notebooks lack accompanying sample datasets, preventing users from running them immediately and limiting the pipeline's accessibility. I would suggest to include (selective) demo data set that can be used to run the notebooks (e.g. for spectral unmixing) and or provide easily accessible demo input sample data for the napari plugins (I saw that there is some sample data for the processing plugin, so this maybe could already be used for the notebooks?).

      We thank the reviewer for this relevant suggestion. The 7 notebooks were updated to automatically download sample tests. The different parts of the pipeline can now be run immediately: https://github.com/GuignardLab/tapenade/tree/chekcs_on_notebooks/src/tapenade/notebooks

      (2) The results for the morphometric analysis (Figure 4) seem to be only shown in lateral (xy) views without the corresponding axial (z) views. I would suggest adding this to the figure and showing the density/strain/angle distributions for those axial views as well.

      We agree with the reviewer that a morphometric analysis based on the axial views would be informative and plan to perform this analysis for the revision.

      (1) Moos, F., Suppinger, S., de Medeiros, G., Oost, K.C., Boni, A., Rémy, C., Weevers, S.L., Tsiairis, C., Strnad, P. and Liberali, P., 2024. Open-top multisample dual-view light-sheet microscope for live imaging of large multicellular systems. Nature Methods, 21(5), pp.798-803.

      (2) Ong, H.T., Karatas, E., Poquillon, T., Grenci, G., Furlan, A., Dilasser, F., Mohamad Raffi, S.B., Blanc, D., Drimaracci, E., Mikec, D. and Galisot, G., 2025. Digitalized organoids: integrated pipeline for high-speed 3D analysis of organoid structures using multilevel segmentation and cellular topology. Nature Methods, 22(6), pp.1343-1354.

      (3) Li, L., Wu, L., Chen, A., Delp, E.J. and Umulis, D.M., 2023. 3D nuclei segmentation for multi-cellular quantification of zebrafish embryos using NISNet3D. Electronic Imaging, 35, pp.1-9.

      (4) Vanaret, J., Dupuis, V., Lenne, P. F., Richard, F., Tlili, S., & Roudot, P. (2023). A detector-independent quality score for cell segmentation without ground truth in 3D live fluorescence microscopy. IEEE Journal of Selected Topics in Quantum Electronics, 29(4: Biophotonics), 1-12.

      (5) Dey, N., Abulnaga, M., Billot, B., Turk, E. A., Grant, E., Dalca, A. V., & Golland, P. (2024). AnyStar: Domain randomized universal star-convex 3D instance segmentation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (pp. 7593-7603).

      (6) Mukashyaka, P., Kumar, P., Mellert, D. J., Nicholas, S., Noorbakhsh, J., Brugiolo, M., ... & Chuang, J. H. (2023). High-throughput deconvolution of 3D organoid dynamics at cellular resolution for cancer pharmacology with Cellos. Nature Communications, 14(1), 8406.

      (7) Rakhymzhan, A., Leben, R., Zimmermann, H., Günther, R., Mex, P., Reismann, D., ... & Niesner, R. A. (2017). Synergistic strategy for multicolor two-photon microscopy: application to the analysis of germinal center reactions in vivo. Scientific reports, 7(1), 7101.

      (8) Dunsing, V., Petrich, A., & Chiantia, S. (2021). Multicolor fluorescence fluctuation spectroscopy in living cells via spectral detection. Elife, 10, e69687.

    1. Author response:

      Reviewer 1:

      There are no significant weaknesses to signal in the manuscript. However, in order to fully conclude that there is no obvious advantage for the linguistic dimension in neonates, it would have been most useful to test a third condition in which the two dimensions were pitted against each other, that is, in which they provide conflicting information as to the boundaries of the words comprised in the artificial language. This last condition would have allowed us to determine whether statistical learning weighs linguistic and non-linguistic features equally, or whether phonetic content is preferentially processed.

      We appreciate the reviewers' suggestion that a stream with conflicting information would provide valuable insights. In the present study, we started with a simpler case involving two orthogonal features (i.e., phonemes and voices), with one feature being informative and the other uninformative, and we found similar learning capacities for both. Future work should explore whether infants—and humans more broadly—can simultaneously track regularities in multiple speech features. However, creating a stream with two conflicting statistical structures is challenging. To use neural entrainment, the two features must lead to segmentation at different chunk sizes so that their effects lead to changes in power/PLV at different frequencies—for instance, using duplets for the voice dimension and triplets for the linguistic dimension  (or vice versa). Consequently, the two dimensions would not be directly comparable within the same participant in terms of the number of distinguishable syllables/voices, memory demand, or SNR given the 1/F decrease in amplitude of background EEG activity. This would involve comparisons between two distinct groups counter-balancing chunk size and linguistic non-linguistic dimension. Considering the test phase, words for one dimension would have been part-words for the other dimension. As we are measuring differences and not preferences, interpreting the results would also have been difficult. Additionally, it may be difficult to find a sufficient number of clearly discriminable voices for such a design (triplets imply 12 voices). Therefore, an entirely different experimental paradigm would need to be developed.

      If such a design were tested, one possibility is that the regularities for the two dimensions are calculated in parallel, in line with the idea that the calculation of statistical regularities is a ubiquitous implicit mechanism (see Benjamin et al., 2024, for a proposed neural mechanism). Yet, similar to our present study, possibly only phonetic features would be used as word candidates. Another possibility is that only one informative feature would be explicitly processed at a time due to the serial nature of perceptual awareness, which may prioritise one feature over the other.

      Note: The reviewer’s summary contains a typo: syllabic rate (4 Hz) –not 2 Hz, and word rate (2 Hz) –not 4 Hz.

      Reviewer 2:

      N400: I am skeptical regarding the interpretation of the phoneme-specific ERP effect as a precursor of the N400 and would suggest toning it down. While the authors are correct in that infant ERP components are typically slower and more posterior compared to adult components, and the observed pattern is hence consistent with an adult N400, at the same time, it could also be a lot of other things. On a functional level, I can't follow the author's argument as to why a violation in phoneme regularity should elicit an N400, since there is no evidence for any semantic processing involved. In sum, I think there is just not enough evidence from the present paradigm to confidently call it an N400.

      The reviewer is correct that we cannot definitively determine the type of processing reflected by the ERP component that appears when neonates hear a triplet after exposure to a stream with phonetic regularities. We interpreted this component as a precursor to the N400, based on prior findings in speech segmentation tasks without semantic content, where a ~400 ms component emerged when adult participants recognised pseudowords (Sander et al., 2002) or during structured streams of syllables (Cunillera et al., 2006, 2009). Additionally, the component we observed had a similar topography and timing to those labelled as N400 in infant studies, where semantic processing was involved (Parise et al., 2010; Friedrich & Friederici, 2011).

      Given our experimental design, the difference we observed must be related to the type of regularity during familiarisation (either phonemes or voices). Thus, we interpreted this component as reflecting lexical search— a process which could be triggered by a linguistic structure but which would not be relevant to a non-linguistic regularity such as voices. However, we are open to alternative interpretations. In any case, this difference between the two streams reveals that computing regularities based on phonemes versus voices does not lead to the same processes. We will revise and tone down the corresponding part of the discussion to clarify that it is just a possible interpretation of the results.  

      Female and male voices: Why did the authors choose to include male and female voices? While using both female and male stimuli of course leads to a higher generalizability, it also introduces a second dimension for one feature that is not present for this other (i.e., phoneme for Experiment 1 and voice identity plus gender for Experiment 2). Hence, couldn't it also be that the infants extracted the regularity with which one gender voice followed the other? For instance, in List B, in the words, one gender is always followed by the other (M-F or F-M), while in 2/3 of the part-words, the gender is repeated (F-F and M-M). Wouldn't you expect the same pattern of results if infants learned regularities based on gender rather than identity?

      We used three female and three male voices to maximise acoustic variability. The streams were synthesised using MBROLA, which provides a limited set of artificial voices. Indeed, there were not enough French voices of acceptable quality, so we also used two Italian voices (the phonemes used existed in both Italian and French).

      Voices differ in timbre, and female voices tend to be higher pitched. However, it is sometimes difficult to categorise low-pitched female voices and high-pitched male voices. Given that gender may be an important factor in infants' speech perception (newborns, for instance, prefer female voices at birth), we conducted tests to assess whether this dimension could have influenced our results.  

      We first quantified the transitional probabilities matrices during the structured stream of Experiment 2, considering that there are only two types of voices: Female and Male.  

      For List A, all transition probabilities are equal to 0.5 (P(M|F), P(F|M), P(M|M), P(F|F)), resulting in flat TPs throughout the stream (see Author response image 1, top). Therefore, we would not expect neural entrainment at the word rate (2 Hz), nor would we anticipate ERP differences between the presented duplets in the test phase.

      For List B, P(M|F)=P(F|M)=0.66 while P(M|M)=P(F|F)=0.33. However, this does not produce a regular pattern of TP drops throughout the stream (see Author response image 1, bottom). As a result, strong neural entrainment at 2 Hz was unlikely, although some degree of entrainment might have occasionally occurred due to some drops occurring at a 2 Hz frequency. Regarding the test phase, all three Words and only one Part-word presented alternating patterns (TP=0.6). Therefore, the difference in the ERPs between Words and Partwords in List B might be attributed to gender alternation.  

      However, it seems unlikely that gender alternation alone explains the entire pattern of results, as the effect is inconsistent and appears in only one of the lists. To rule out this possibility, we analysed the effects in each list separately.

      Author response image 1.

      Transition probabilities (TPs) across the structured stream in Experiment 2, considering voices processed by gender (Female or Male). Top: List A. Bottom: List B.

      We computed the mean activation within the time windows and electrodes of interest and compared the effects of word type and list using a two-way ANOVA. For the difference between Words and Part-words over the positive cluster, we observed a main effect of word type (F(1,31) = 5.902, p = 0.021), with no effects of list or interactions (p > 0.1). Over the negative cluster, we again observed a main effect of word type (F(1,31) = 10.916, p = 0.0016), with no effects of list or interactions (p > 0.1). See Author response image 2.  

      Author response image 2.

      Difference in ERP voltage (Words – Part-words) for the two lists (A and B); W=Words; P=Part-Words, 

      We conducted a similar analysis for neural entrainment during the structured stream on voices. A comparison of entrainment at 2 Hz between participants who completed List A and List B showed no significant differences (t(30) = -0.27, p = 0.79). A test against zero for each list indicated significant entrainment in both cases (List A: t(17) = 4.44, p = 0.00036; List B: t(13) = 3.16, p = 0.0075). See Author response image 3.

      Author response image 3.

      Neural entrainment at 2Hz during the structured stream of Experiment 2 for Lists A and B.

      Words entrainment over occipital electrodes: Do you have any idea why the duplet entrainment effect occurs over the electrodes it does, in particular over the occipital electrodes (which seems a bit unintuitive given that this is a purely auditory experiment with sleeping neonates).

      Neural entrainment might be considered as a succession of evoked response induced by the stream. After applying an average reference in high-density EEG recordings, the auditory ERP in neonates typically consists of a central positivity and a posterior negativity with a source located at the electrical zero in a single-dipole model (i.e. approximately in the superior temporal region (Dehaene-Lambertz & Dehaene, 1994). In adults, because of the average reference (i.e. the sum of voltages is equal to zero at each time point) and because the electrodes cannot capture the negative pole of the auditory response, the negativity is distributed around the head. In infants, however, the brain is higher within the skull, allowing for a more accurate recording of the negative pole of the auditory ERP (see Author response image 4 for the location of electrodes in an infant head model).  

      Besides the posterior electrodes, we can see some entrainment on more anterior electrodes that probably corresponds to the positive pole of the auditory ERP.

      Author response image 4.

      International 10–20 sensors' location on the skull of an infant template, with the underlying 3-D reconstruction of the grey-white matter interface and projection of each electrode to the cortex. Computed across 16 infants (from Kabdebon et al, Neuroimage, 2014). The O1, O2, T5, and T6 electrodes project lower than in adults.

      Reviewer 3:

      (1) While it's true that voice is not essential for language (i.e., sign languages are implemented over gestures; the use of voices to produce non-linguistic sounds, like laughter), it is a feature of spoken languages. Thus I'm not sure if we can really consider this study as a comparison between linguistic and non-linguistic dimensions. In turn, I'm not sure that these results show that statistical learning at birth operates on non-linguistic features, being voices a linguistic dimension at least in spoken languages. I'd like to hear the authors' opinions on this.

      On one hand, it has been shown that statistical learning (SL) operates across multiple modalities and domains in human adults and animals. On the other hand, SL is considered essential for infants to begin parsing speech. Therefore, we aimed to investigate whether SL capacities at birth are more effective on linguistic dimensions of speech, potentially as a way to promote language learning.

      We agree with the reviewer that voices play an important role in communication (e.g., for identifying who is speaking); however, they do not contribute to language structure or meaning, and listeners are expected to normalize across voices to accurately perceive phonemes and words. Thus, voices are speech features but not linguistic features. Additionally, in natural speech, there are no abrupt voice changes within a word as in our experiment; instead, voice changes typically occur on a longer timescale and involve only a limited number of voices, such as in a dialogue. Therefore, computing regularities based on voice changes would not be useful in real-life language learning. We considered that contrasting syllables and voices was an elegant way to test SL beyond its linguistic dimension, as the experimental paradigm is identical in both experiments.  

      Along the same line, in the Discussion section, the present results are interpreted within a theoretical framework showing statistical learning in auditory non-linguistic (string of tones, music) and visual domains as well as visual and other animal species. I'm not sure if that theoretical framework is the right fit for the present results.

      (2) I'm not sure whether the fact that we see parallel and independent tracking of statistics in the two dimensions of speech at birth indicates that newborns would be able to do so in all the other dimensions of the speech. If so, what other dimensions are the authors referring to?

      The reviewer is correct that demonstrating the universality of SL requires testing additional modalities and acoustic dimensions. However, we postulate that SL is grounded in a basic mechanism of long-term associative learning, as proposed in Benjamin et al. (2024), which relies on a slow decay in the representation of a given event. This simple mechanism, capable of operating on any representational output, accounts for many types of sequence learning reported in the literature (Benjamin et al., in preparation). We will revise the discussion section to clarify this theoretical framework.

      (3) Lines 341-345: Statistical learning is an evolutionary ancient learning mechanism but I do not think that the present results are showing it. This is a study on human neonates and adults, there are no other animal species involved therefore I do not see a connection with the evolutionary history of statistical learning. It would be much more interesting to make claims on the ontogeny (rather than philogeny) of statistical learning, and what regularities newborns are able to detect right after birth. I believe that this is one of the strengths of this work.

      We did not intend to make claims about the phylogeny of SL. Since SL appears to be a learning mechanism shared across species, we use it as a framework to suggest that SL may arise from general operational principles applicable to diverse neural networks. Thus, while it is highly useful for language acquisition, it is not specific to it. We will revise this section to tone down our claims.  

      (4) The description of the stimuli in Lines 110-113 is a bit confusing. In Experiment 1, e.g., "pe" and "tu" are both uttered by the same voice, correct? ("random voice each time" is confusing). Whereas in Experiment 2, e.g., "pe" and "tu" are uttered by different voices, for example, "pe" by yellow voice and "tu" by red voice. If this is correct, then I recommend the authors to rephrase this section to make it more clear.

      To clarify, in Experiment 1, the voices were randomly assigned to each syllable, with the constraint that no voice was repeated consecutively. This means that syllables within the same word were spoken by different voices, and each syllable was heard with various voices throughout the stream. As a result, neonates had to retrieve the words based solely on syllabic patterns, without relying on consistent voice associations or specific voice relationships.

      In Experiment 2, the design was orthogonal: while the syllables were presented in a random order, the voices followed a structured pattern. Similar to Experiment 1, each syllable (e.g., “pe” and “tu”) was spoken by different voices. The key difference is that in Experiment 2, the structured regularities were applied to the voices rather than the syllables. In other words, the “green” voice was always followed by the “red” voice for example but uttered different syllables.

      We will revise the methods section to clarify these important points.

      (5) Line 114: the sentence "they should compute a 36 x 36 TPs matrix relating each acoustic signal, with TPs alternating between 1/6 within words and 1/12 between words" is confusing as it seems like there are different acoustic signals. Can the authors clarify this point?

      Thank you for highlighting this point. To clarify, our suggestion is that neonates might not track regularities between phonemes and voices as separate features. Instead, they may treat each syllable-voice combination as a distinct item—for example, "pe" spoken by the "yellow" voice is one item, while "pe" spoken by the "red" voice is another. Under this scenario, there would be a total of 36 unique items (6 syllables × 6 voices), and infants would need to track regularities between these 36 combinations.

      We will rephrase this sentence in the manuscript to make it clearer.

    1. Author Response

      eLife assessment

      This study presents potentially valuable results on glutamine-rich motifs in relation to protein expression and alternative genetic codes. The author's interpretation of the results is so far only supported by incomplete evidence, due to a lack of acknowledgment of alternative explanations, missing controls and statistical analysis and writing unclear to non experts in the field. These shortcomings could be at least partially overcome by additional experiments, thorough rewriting, or both.

      We thank both the Reviewing Editor and Senior Editor for handling this manuscript and will submit our revised manuscript after the reviewed preprint is published by eLife.  

      Reviewer #1 (Public Review):

      Summary

      This work contains 3 sections. The first section describes how protein domains with SQ motifs can increase the abundance of a lacZ reporter in yeast. The authors call this phenomenon autonomous protein expression-enhancing activity, and this finding is well supported. The authors show evidence that this increase in protein abundance and enzymatic activity is not due to changes in plasmid copy number or mRNA abundance, and that this phenomenon is not affected by mutants in translational quality control. It was not completely clear whether the increased protein abundance is due to increased translation or to increased protein stability.

      In section 2, the authors performed mutagenesis of three N-terminal domains to study how protein sequence changes protein stability and enzymatic activity of the fusions. These data are very interesting, but this section needs more interpretation. It is not clear if the effect is due to the number of S/T/Q/N amino acids or due to the number of phosphorylation sites.

      In section 3, the authors undertake an extensive computational analysis of amino acid runs in 27 species. Many aspects of this section are fascinating to an expert reader. They identify regions with poly-X tracks. These data were not normalized correctly: I think that a null expectation for how often poly-X track occur should be built for each species based on the underlying prevalence of amino acids in that species. As a result, I believe that the claim is not well supported by the data.

      Strengths

      This work is about an interesting topic and contains stimulating bioinformatics analysis. The first two sections, where the authors investigate how S/T/Q/N abundance modulates protein expression level, is well supported by the data. The bioinformatics analysis of Q abundance in ciliate proteomes is fascinating. There are some ciliates that have repurposed stop codons to code for Q. The authors find that in these proteomes, Q-runs are greatly expanded. They offer interesting speculations on how this expansion might impact protein function.

      Weakness

      At this time, the manuscript is disorganized and difficult to read. An expert in the field, who will not be distracted by the disorganization, will find some very interesting results included. In particular, the order of the introduction does not match the rest of the paper.

      In the first and second sections, where the authors investigate how S/T/Q/N abundance modulates protein expression levels, it is unclear if the effect is due to the number of phosphorylation sites or the number of S/T/Q/N residues.

      There are three reasons why the number of phosphorylation sites in the Q-rich motifs is not relevant to their autonomous protein expression-enhancing (PEE) activities:

      First, we have reported previously that phosphorylation-defective Rad51-NTD (Rad51-3SA) and wild-type Rad51-NTD exhibit similar autonomous PEE activity. Mec1/Tel1-dependent phosphorylation of Rad51-NTD antagonizes the proteasomal degradation pathway, increasing the half-life of Rad51 from ∼30 min to ≥180 min (Ref 27; Woo, T. T. et al. 2020).

      1. T. T. Woo, C. N. Chuang, M. Higashide, A. Shinohara, T. F. Wang, Dual roles of yeast Rad51 N-terminal domain in repairing DNA double-strand breaks. Nucleic Acids Res 48, 8474-8489 (2020).

      Second, in our preprint manuscript, we have also shown that phosphorylation-defective Rad53-SCD1 (Rad51-SCD1-5STA) also exhibits autonomous PEE activity similar to that of wild-type Rad53-SCD (Figure 2D, Figure 4A and Figure 4C).

      Third, as revealed by the results of our preprint manuscript (Figure 4), it is the percentages, and not the numbers, of S/T/Q/N residues that are correlated with the PEE activities of Q-rich motifs.

      The authors also do not discuss if the N-end rule for protein stability applies to the lacZ reporter or the fusion proteins.

      The autonomous PEE function of S/T/Q-rich NTDs is unlikely to be relevant to the N-end rule. The N-end rule links the in vivo half-life of a protein to the identity of its N-terminal residues. In S. cerevisiae, the N-end rule operates as part of the ubiquitin system and comprises two pathways. First, the Arg/N-end rule pathway, involving a single N-terminal amidohydrolase Nta1, mediates deamidation of N-terminal asparagine (N) and glutamine (Q) into aspartate (D) and glutamate (E), which in turn are arginylated by a single Ate1 R-transferase, generating the Arg/N degron. N-terminal R and other primary degrons are recognized by a single N-recognin Ubr1 in concert with ubiquitin-conjugating Ubc2/Rad6. Ubr1 can also recognize several other N-terminal residues, including lysine (K), histidine (H), phenylalanine (F), tryptophan (W), leucine (L) and isoleucine (I) (Bachmair, A. et al. 1986; Tasaki, T. et al. 2012; Varshavshy, A. et al. 2019). Second, the Ac/N-end rule pathway targets proteins containing N-terminally acetylated (Ac) residues. Prior to acetylation, the first amino acid methionine (M) is catalytically removed by Met-aminopeptides, unless a residue at position 2 is non-permissive (too large) for MetAPs. If a retained N-terminal M or otherwise a valine (V), cysteine (C), alanine (A), serine (S) or threonine (T) residue is followed by residues that allow N-terminal acetylation, the proteins containing these AcN degrons are targeted for ubiquitylation and proteasome-mediated degradation by the Doa10 E3 ligase (Hwang, C. S., 2019).

      A. Bachmair, D. Finley, A. Varshavsky, In vivo half-life of a protein is a function of its amino-terminal residue. Science 234, 179-186 (1986).

      T. Tasaki, S. M. Sriram, K. S. Park, Y. T. Kwon, The N-end rule pathway. Annu Rev Biochem 81, 261-289 (2012).

      A. Varshavsky, N-degron and C-degron pathways of protein degradation. Proc Natl Acad Sci 116, 358-366 (2019).

      C. S. Hwang, A. Shemorry, D. Auerbach, A. Varshavsky, The N-end rule pathway is mediated by a complex of the RING-type Ubr1 and HECT-type Ufd4 ubiquitin ligases. Nat Cell Biol 12, 1177-1185 (2010).

      The PEE activities of these S/T/Q-rich domains are unlikely to arise from counteracting the N-end rule for two reasons. First, the first two amino acid residues of Rad51-NTD, Hop1-SCD, Rad53-SCD1, Sup35-PND, Rad51-ΔN, and LacZ-NVH are MS, ME, ME, MS, ME, and MI, respectively, where M is methionine, S is serine, E is glutamic acid and I is isoleucine. Second, Sml1-NTD behaves similarly to these N-terminal fusion tags, despite its methionine and glutamine (MQ) amino acid signature at the N-terminus.

      The most interesting part of the paper is an exploration of S/T/Q/N-rich regions and other repetitive AA runs in 27 proteomes, particularly ciliates. However, this analysis is missing a critical control that makes it nearly impossible to evaluate the importance of the findings. The authors find the abundance of different amino acid runs in various proteomes. They also report the background abundance of each amino acid. They do not use this background abundance to normalize the runs of amino acids to create a null expectation from each proteome. For example, it has been clear for some time (Ruff, 2017; Ruff et al., 2016) that Drosophila contains a very high background of Q's in the proteome and it is necessary to control for this background abundance when finding runs of Q's.

      We apologize for not explaining sufficiently well the topic eliciting this reviewer’s concern in our preprint manuscript. In the second paragraph of page 14, we cite six references to highlight that SCDs are overrepresented in yeast and human proteins involved in several biological processes (32, 74), and that polyX prevalence differs among species (43, 75-77).

      1. Cheung HC, San Lucas FA, Hicks S, Chang K, Bertuch AA, Ribes-Zamora A. An S/T-Q cluster domain census unveils new putative targets under Tel1/Mec1 control. BMC Genomics. 2012;13:664.

      2. Mier P, Elena-Real C, Urbanek A, Bernado P, Andrade-Navarro MA. The importance of definitions in the study of polyQ regions: A tale of thresholds, impurities and sequence context. Comput Struct Biotechnol J. 2020;18:306-13.

      3. Cara L, Baitemirova M, Follis J, Larios-Sanz M, Ribes-Zamora A. The ATM- and ATR-related SCD domain is over-represented in proteins involved in nervous system development. Sci Rep. 2016;6:19050.

      4. Kuspa A, Loomis WF. The genome of Dictyostelium discoideum. Methods Mol Biol. 2006;346:15-30.

      5. Davies HM, Nofal SD, McLaughlin EJ, Osborne AR. Repetitive sequences in malaria parasite proteins. FEMS Microbiol Rev. 2017;41(6):923-40.

      6. Mier P, Alanis-Lobato G, Andrade-Navarro MA. Context characterization of amino acid homorepeats using evolution, position, and order. Proteins. 2017;85(4):709-19.

      We will cite the two references by Kiersten M. Ruff in our revised manuscript.

      K. M. Ruff and R. V. Pappu, (2015) Multiscale simulation provides mechanistic insights into the effects of sequence contexts of early-stage polyglutamine-mediated aggregation. Biophysical Journal 108, 495a.

      K. M. Ruff, J. B. Warner, A. Posey and P. S. Tan (2017) Polyglutamine length dependent structural properties and phase behavior of huntingtin exon1. Biophysical Journal 112, 511a.

      The authors could easily address this problem with the data and analysis they have already collected. However, at this time, without this normalization, I am hesitant to trust the lists of proteins with long runs of amino acid and the ensuing GO enrichment analysis.

      Ruff KM. 2017. Washington University in St.

      Ruff KM, Holehouse AS, Richardson MGO, Pappu RV. 2016. Proteomic and Biophysical Analysis of Polar Tracts. Biophys J 110:556a.

      We thank Reviewer #1 for this helpful suggestion and now address this issue by means of a different approach described below.

      Based on a previous study (43; Palo Mier et al. 2020), we applied seven different thresholds to seek both short and long, as well as pure and impure, polyX strings in 20 different representative near-complete proteomes, including 4X (4/4), 5X (4/5-5/5), 6X (4/6-6/6), 7X (4/7-7/7), 8-10X (≥50%X), 11-10X (≥50%X) and ≥21X (≥50%X).

      To normalize the runs of amino acids and create a null expectation from each proteome, we determined the ratios of the overall number of X residues for each of the seven polyX motifs relative to those in the entire proteome of each species, respectively. The results of four different polyX motifs are shown below, i.e., polyQ (Author response image 1), polyN (Author response image 2), polyS (Author response image 3) and polyT (Author response image 4).

      Author response image 1.

      Q contents in 7 different types of polyQ motifs in 20 near-complete proteomes. The five ciliates with reassigned stops codon (TAAQ and TAGQ) are indicated in red. Stentor coeruleus, a ciliate with standard stop codons, is indicated in green.  

      Author response image 2.

      N contents in 7 different types of polyN motifs in 20 near-complete proteomes. The five ciliates with reassigned stops codon (TAAQ and TAGQ) are indicated in red. Stentor coeruleus, a ciliate with standard stop codons, is indicated in green.

      Author response image 3.

      S contents in 7 different types of polyS motifs in 20 near-complete proteomes. The five ciliates with reassigned stops codon (TAAQ and TAGQ) are indicated in red. Stentor coeruleus, a ciliate with standard stop codons, is indicated in green.  

      Author response image 4.

      T contents in 7 different types of polyT motifs in 20 near-complete proteomes. The five ciliates with reassigned stops codon (TAAQ and TAGQ) are indicated in red. Stentor coeruleus, a ciliate with standard stop codons, is indicated in green.

      The results summarized in these four new figures support that polyX prevalence differs among species and that the overall X contents of polyX motifs often but not always correlate with the X usage frequency in entire proteomes (43; Palo Mier et al. 2020).

      Most importantly, our results reveal that, compared to Stentor coeruleus or several non-ciliate eukaryotic organisms (e.g., Plasmodium falciparum, Caenorhabditis elegans, Danio rerio, Mus musculus and Homo sapiens), the five ciliates with reassigned TAAQ and TAGQ codons not only have higher Q usage frequencies, but also more polyQ motifs in their proteomes (Figure 1). In contrast, polyQ motifs prevail in Candida albicans, Candida tropicalis, Dictyostelium discoideum, Chlamydomonas reinhardtii, Drosophila melanogaster and Aedes aegypti, though the Q usage frequencies in their entire proteomes are not significantly higher than those of other eukaryotes (Figure 1). Due to their higher N usage frequencies, Dictyostelium discoideum, Plasmodium falciparum and Pseudocohnilembus persalinus have more polyN motifs than the other 23 eukaryotes we examined here (Figure 2). Generally speaking, all 26 eukaryotes we assessed have similar S usage frequencies and percentages of S contents in polyS motifs (Figure 3). Among these 26 eukaryotes, Dictyostelium discoideum possesses many more polyT motifs, though its T usage frequency is similar to that of the other 25 eukaryotes (Figure 4).

      In conclusion, these new normalized results confirm that the reassignment of stop codons to Q indeed results in both higher Q usage frequencies and more polyQ motifs in ciliates.  

      Reviewer #2 (Public Review):

      Summary:

      This study seeks to understand the connection between protein sequence and function in disordered regions enriched in polar amino acids (specifically Q, N, S and T). While the authors suggest that specific motifs facilitate protein-enhancing activities, their findings are correlative, and the evidence is incomplete. Similarly, the authors propose that the re-assignment of stop codons to glutamine-encoding codons underlies the greater user of glutamine in a subset of ciliates, but again, the conclusions here are, at best, correlative. The authors perform extensive bioinformatic analysis, with detailed (albeit somewhat ad hoc) discussion on a number of proteins. Overall, the results presented here are interesting, but are unable to exclude competing hypotheses.

      Strengths:

      Following up on previous work, the authors wish to uncover a mechanism associated with poly-Q and SCD motifs explaining proposed protein expression-enhancing activities. They note that these motifs often occur IDRs and hypothesize that structural plasticity could be capitalized upon as a mechanism of diversification in evolution. To investigate this further, they employ bioinformatics to investigate the sequence features of proteomes of 27 eukaryotes. They deepen their sequence space exploration uncovering sub-phylum-specific features associated with species in which a stop-codon substitution has occurred. The authors propose this stop-codon substitution underlies an expansion of ploy-Q repeats and increased glutamine distribution.

      Weaknesses:

      The preprint provides extensive, detailed, and entirely unnecessary background information throughout, hampering reading and making it difficult to understand the ideas being proposed. The introduction provides a large amount of detailed background that appears entirely irrelevant for the paper. Many places detailed discussions on specific proteins that are likely of interest to the authors occur, yet without context, this does not enhance the paper for the reader.

      The paper uses many unnecessary, new, or redefined acronyms which makes reading difficult. As examples:

      (1) Prion forming domains (PFDs). Do the authors mean prion-like domains (PLDs), an established term with an empirical definition from the PLAAC algorithm? If yes, they should say this. If not, they must define what a prion-forming domain is formally.

      The N-terminal domain (1-123 amino acids) of S. cerevisiae Sup35 was already referred to as a “prion forming domain (PFD)” in 2006 (Tuite, M. F. 2006). Since then, PFD has also been employed as an acronym in other yeast prion papers (Cox, B.S. et al. 2007; Toombs, T. et al. 2011).

      M. F., Tuite, Yeast prions and their prion forming domain. Cell 27, 397-407 (2005).

      B. S. Cox, L. Byrne, M. F., Tuite, Protein Stability. Prion 1, 170-178 (2007).

      J. A. Toombs, N. M. Liss, K. R. Cobble, Z. Ben-Musa, E. D. Ross, [PSI+] maintenance is dependent on the composition, not primary sequence, of the oligopeptide repeat domain. PLoS One 6, e21953 (2011).

      (2) SCD is already an acronym in the IDP field (meaning sequence charge decoration) - the authors should avoid this as their chosen acronym for Serine(S) / threonine (T)-glutamine (Q) cluster domains. Moreover, do we really need another acronym here (we do not).

      SCD was first used in 2005 as an acronym for the Serine (S)/threonine (T)-glutamine (Q) cluster domain in the DNA damage checkpoint field (Traven, A. and Heierhorst, J. 2005). Almost a decade later, SCD became an acronym for “sequence charge decoration” (Sawle, L. et al. 2015; Firman, T. et al. 2018).

      A. Traven and J, Heierhorst, SQ/TQ cluster domains: concentrated ATM/ATR kinase phosphorylation site regions in DNA-damage-response proteins. Bioessays. 27, 397-407 (2005).

      L. Sawle and K, Ghosh, A theoretical method to compute sequence dependent configurational properties in charged polymers and proteins. J. Chem Phys. 143, 085101(2015).

      T. Firman and Ghosh, K. Sequence charge decoration dictates coil-globule transition in intrinsically disordered proteins. J. Chem Phys. 148, 123305 (2018).

      (3) Protein expression-enhancing (PEE) - just say expression-enhancing, there is no need for an acronym here.

      Thank you. Since we have shown that addition of Q-rich motifs to LacZ affects protein expression rather than transcription, we think it is better to use the “PEE” acronym.

      The results suggest autonomous protein expression-enhancing activities of regions of multiple proteins containing Q-rich and SCD motifs. Their definition of expression-enhancing activities is vague and the evidence they provide to support the claim is weak. While their previous work may support their claim with more evidence, it should be explained in more detail. The assay they choose is a fusion reporter measuring beta-galactosidase activity and tracking expression levels. Given the presented data they have shown that they can drive the expression of their reporters and that beta gal remains active, in addition to the increase in expression of fusion reporter during the stress response. They have not detailed what their control and mock treatment is, which makes complete understanding of their experimental approach difficult. Furthermore, their nuclear localization signal on the tag could be influencing the degradation kinetics or sequestering the reporter, leading to its accumulation and the appearance of enhanced expression. Their evidence refuting ubiquitin-mediated degradation does not have a convincing control.

      Based on the experimental results, the authors then go on to perform bioinformatic analysis of SCD proteins and polyX proteins. Unfortunately, there is no clear hypothesis for what is being tested; there is a vague sense of investigating polyX/SCD regions, but I did not find the connection between the first and section compelling (especially given polar-rich regions have been shown to engage in many different functions). As such, this bioinformatic analysis largely presents as many lists of percentages without any meaningful interpretation. The bioinformatics analysis lacks any kind of rigorous statistical tests, making it difficult to evaluate the conclusions drawn. The methods section is severely lacking. Specifically, many of the methods require the reader to read many other papers. While referencing prior work is of course, important, the authors should ensure the methods in this paper provide the details needed to allow a reader to evaluate the work being presented. As it stands, this is not the case.

      Thank you. As described in detail below, we have now performed rigorous statistical testing using the GofuncR package.

      Overall, my major concern with this work is that the authors make two central claims in this paper (as per the Discussion). The authors claim that Q-rich motifs enhance protein expression. The implication here is that Q-rich motif IDRs are special, but this is not tested. As such, they cannot exclude the competing hypothesis ("N-terminal disordered regions enhance expression").

      In fact, “N-terminal disordered regions enhance expression” exactly summarizes our hypothesis.

      On pages 12-13 and Figure 4 of our preprint manuscript, we explained our hypothesis in the paragraph entitled “The relationship between PEE function, amino acid contents, and structural flexibility”.

      The authors also do not explore the possibility that this effect is in part/entirely driven by mRNA-level effects (see Verma Na Comms 2019).

      As pointed out by the first reviewer, we show evidence that the increase in protein abundance and enzymatic activity is not due to changes in plasmid copy number or mRNA abundance (Figure 2), and that this phenomenon is not affected by translational quality control mutants (Figure 3).

      As such, while these observations are interesting, they feel preliminary and, in my opinion, cannot be used to draw hard conclusions on how N-terminal IDR sequence features influence protein expression. This does not mean the authors are necessarily wrong, but from the data presented here, I do not believe strong conclusions can be drawn. That re-assignment of stop codons to Q increases proteome-wide Q usage. I was unable to understand what result led the authors to this conclusion.

      My reading of the results is that a subset of ciliates has re-assigned UAA and UAG from the stop codon to Q. Those ciliates have more polyQ-containing proteins. However, they also have more polyN-containing proteins and proteins enriched in S/T-Q clusters. Surely if this were a stop-codon-dependent effect, we'd ONLY see an enhancement in Q-richness, not a corresponding enhancement in all polar-rich IDR frequencies? It seems the better working hypothesis is that free-floating climate proteomes are enriched in polar amino acids compared to sessile ciliates.

      Thank you. These comments are not supported by the results in Figure 1.

      Regardless, the absence of any kind of statistical analysis makes it hard to draw strong conclusions here.

      We apologize for not explaining more clearly the results of Tables 5-7 in our preprint manuscript.

      To address the concerns about our GO enrichment analysis by both reviewers, we have now performed rigorous statistical testing for SCD and polyQ protein overrepresentation using the GOfuncR package (https://bioconductor.org/packages/release/bioc/html/GOfuncR.html). GOfuncR is an R package program that conducts standard candidate vs. background enrichment analysis by means of the hypergeometric test. We then adjusted the raw p-values according to the Family-wise error rate (FWER). The same method had been applied to GO enrichment analysis of human genomes (Huttenhower, C., et al. 2009).

      Curtis Huttenhower, C., Haley, E. M., Hibbs, M., A., Dumeaux, V., Barrett, D. R., Hilary A. Coller, H. A., and Olga G. Troyanskaya, O., G. Exploring the human genome with functional maps, Genome Research 19, 1093-1106 (2009).

      The results presented in Author response image 5 and Author response image 6 support our hypothesis that Q-rich motifs prevail in proteins involved in specialized biological processes, including Saccharomyces cerevisiae RNA-mediated transposition, Candida albicans filamentous growth, peptidyl-glutamic acid modification in ciliates with reassigned stop codons (TAAQ and TAGQ), Tetrahymena thermophila xylan catabolism, Dictyostelium discoideum sexual reproduction, Plasmodium falciparum infection, as well as the nervous systems of Drosophila melanogaster, Mus musculus, and Homo sapiens (74). In contrast, peptidyl-glutamic acid modification and microtubule-based movement are not overrepresented with Q-rich proteins in Stentor coeruleus, a ciliate with standard stop codons.

      1. Cara L, Baitemirova M, Follis J, Larios-Sanz M, Ribes-Zamora A. The ATM- and ATR-related SCD domain is over-represented in proteins involved in nervous system development. Sci Rep. 2016;6:19050.

      Author response image 5.

      Selection of biological processes with overrepresented SCD-containing proteins in different eukaryotes. The percentages and number of SCD-containing proteins in our search that belong to each indicated Gene Ontology (GO) group are shown. GOfuncR (Huttenhower, C., et al. 2009) was applied for GO enrichment and statistical analysis. The p values adjusted according to the Family-wise error rate (FWER) are shown. The five ciliates with reassigned stop codons (TAAQ and TAGQ) are indicated in red. Stentor coeruleus, a ciliate with standard stop codons, is indicated in green.

      Author response image 6.

      Selection of biological processes with overrepresented polyQ-containing proteins in different eukaryotes. The percentages and numbers of polyQ-containing proteins in our search that belong to each indicated Gene Ontology (GO) group are shown. GOfuncR (Huttenhower, C., et al. 2009) was applied for GO enrichment and statistical analysis. The p values adjusted according to the Family-wise error rate (FWER) are shown. The five ciliates with reassigned stops codons (TAAQ and TAGQ) are indicated in red. Stentor coeruleus, a ciliate with standard stop codons, is indicated in green.

    1. Author response:

      Reviewer #1 (Public review):

      Summary:

      Jocher, Janssen, et al examine the robustness of comparative functional genomics studies in primates that make use of induced pluripotent stem cell-derived cells. Comparative studies in primates, especially amongst the great apes, are generally hindered by the very limited availability of samples, and iPSCs, which can be maintained in the laboratory indefinitely and defined into other cell types, have emerged as promising model systems because they allow the generation of data from tissues and cells that would otherwise be unobservable.

      Undirected differentiation of iPSCs into many cell types at once, using a method known as embryoid body differentiation, requires researchers to manually assign all cell types in the dataset so they can be correctly analysed. Typically, this is done using marker genes associated with a specific cell type. These are defined a priori, and have historically tended to be characterised in mice and humans and then employed to annotate other species. Jocher, Janssen, et al ask if the marker genes and features used to define a given cell type in one species are suitable for use in a second species, and then quantify the degree of usefulness of these markers. They find that genes that are informative and cell type specific in a given species are less valuable for cell type identification in other species, and that this value, or transferability, drops off as the evolutionary distance between species increases.

      This paper will help guide future comparative studies of gene expression in primates (and more broadly) as well as add to the growing literature on the broader challenges of selecting powerful and reliable marker genes for use in single-cell transcriptomics.

      Strengths:

      Marker gene selection and cell type annotation is a challenging problem in scRNA studies, and successful classification of cells often requires manual expert input. This can be hard to reproduce across studies, as, despite general agreement on the identity of many cell types, different methods for identifying marker genes will return different sets of genes. The rise of comparative functional genomics complicates this even further, as a robust marker gene in one species need not always be as useful in a different taxon. The finding that so many marker genes have poor transferability is striking, and by interrogating the assumption of transferability in a thorough and systematic fashion, this paper reminds us of the importance of systematically validating analytical choices. The focus on identifying how transferability varies across different types of marker genes (especially when comparing TFs to lncRNAs), and on exploring different methods to identify marker genes, also suggests additional criteria by which future researchers could select robust marker genes in their own data.

      The paper is built on a substantial amount of clearly reported and thoroughly considered data, including EBs and cells from four different primate species - humans, orangutans, and two macaque species. The authors go to great lengths to ensure the EBs are as comparable as possible across species, and take similar care with their computational analyses, always erring on the side of drawing conservative conclusions that are robustly supported by their data over more tenuously supported ones that could be impacted by data processing artefacts such as differences in mappability, etc. For example, I like the approach of using liftoff to robustly identify genes in non-human species that can be mapped to and compared across species confidently, rather than relying on the likely incomplete annotation of the non-human primate genomes. The authors also provide an interactive data visualisation website that allows users to explore the dataset in depth, examine expression patterns of their own favourite marker genes and perform the same kinds of analyses on their own data if desired, facilitating consistency between comparative primate studies.

      We thank the Reviewer for their kind assessment of our work.

      Weaknesses and recommendations:

      (1) Embryoid body generation is known to be highly variable from one replicate to the next for both technical and biological reasons, and the authors do their best to account for this, both by their testing of different ways of generating EBs, and by including multiple technical replicates/clones per species. However, there is still some variability that could be worth exploring in more depth. For example, the orangutan seems to have differentiated preferentially towards cardiac mesoderm whereas the other species seemed to prefer ectoderm fates, as shown in Figure 2C. Likewise, Supplementary Figure 2C suggests a significant unbalance in the contributions across replicates within a species, which is not surprising given the nature of EBs, while Supplementary Figure 6 suggests that despite including three different clones from a single rhesus macaque, most of the data came from a single clone. The manuscript would be strengthened by a more thorough exploration of the intra-species patterns of variability, especially for the taxa with multiple biological replicates, and how they impact the number of cell types detected across taxa, etc.

      You are absolutely correct in pointing out that the large clonal variability in cell type composition is a challenge for our analysis. We also noted the odd behavior of the orangutan EBs, and their underrepresentation of ectoderm. There are many possible sources for these variable differentiation propensities: clone, sample origin (in this case urine) and individual. However, unfortunately for the orangutan, we have only one individual and one sample origin and thus cannot say whether this germ layer preference says something about the species or is due to our specific sample.

      Because of this high variability from multiple sources, getting enough cell types with an appreciable overlap between species was limiting to analyses. In order to be able to derive meaningful conclusions from intra-species analyses and the impact of different sources of variation on cell type propensity, we would need to sequence many more EBs with an experimental design that balances possible sources of variation. This would go beyond the scope of this study.

      Instead, here we control for intra-species variation in our analyses as much as possible: For the analysis of cell type specificity and conservation the comparison is relative for the different specificity degrees (Figure 3C).  For the analysis of marker gene conservation, we explicitly take intra-species variation into account (Figure 4D).

      The same holds for the temporal aspect of the data, which is not really discussed in depth despite being a strength of the design. Instead, days 8 and 16 are analysed jointly, without much attention being paid to the possible differences between them.

      Concerning the temporal aspect, indeed we knowingly omitted to include an explicit comparison of day 8 and day 16 EBs, because we felt that it was not directly relevant to our main message. Our pseudotime analysis showed that the differences of the two time points were indeed a matter of degree and not so much of quality. All major lineages were already present at day 8 and even though day 8 cells had on average earlier pseudotimes, there was a large overlap in the pseudotime distributions between the two sampling time points (Author response image 1). That is why we decided to analyse the data together.

      Are EBs at day 16 more variable between species than at day 8? Is day 8 too soon to do these kinds of analyses?

      When we started the experiment, we simply did not know what to expect. We were worried that cell types at day 8 might be too transient, but longer culture can also introduce biases. That is why we wanted to look at two time points, however as mentioned above the differences are in degree.

      Concerning the cell type composition: yes, day 16 EBs are more heterogeneous than day 8 EBs. Firstly, older EBs have more distinguishable cell types and hence even if all EBs had identical composition, the sampling variance would be higher given that we sampled a similar number of cells from both time points. Secondly, in order to grow EBs for a longer time, we moved them from floating to attached culture on day 8 and it is unclear how much variance is added by this extra handling step.

      Are markers for earlier developmental progenitors better/more transferable than those for more derived cell types?

      We did not see any differences in the marker conservation between early and late cell types, but we have too little data to say whether this carries biological meaning.

      Author response image 1.

      Pseudotime analysis for a differentiation trajectory towards neurons. Single cells were first aggregated into metacells per species using SEACells (Persad et al. 2023). Pluripotent and ectoderm metacells were then integrated across all four species using Harmony and a combined pseudotime was inferred with Slingshot (Street et al. 2018), specifying iPSCs as the starting cluster. Here, lineage 3 is shown, illustrating a differentiation towards neurons. (A) PHATE embedding colored by pseudotime (Moon et al. 2019). (B) PHATE embedding colored by celltype. (C) Pseudotime distribution across the sampling timepoints (day 8 and day 16) in different species.

      (2) Closely tied to the point above, by necessity the authors collapse their data into seven fairly coarse cell types and then examine the performance of canonical marker genes (as well as those discovered de novo) across the species. However some of the clusters they use are somewhat broad, and so it is worth asking whether the lack of specificity exhibited by some marker genes and driving their conclusions is driven by inter-species heterogeneity within a given cluster.

      Author response image 2.

      UMAP visualization for the Harmony-integrated dataset across all four species for the seven shared cell types, colored by cell type identity (A) and species (B).

      Good point, if we understand correctly, the concern is that in our relatively broadly defined cell types, species are not well mixed and that this in turn is partly responsible for marker gene divergence. This problem is indeed difficult to address, because most approaches to evaluate this require integration across species which might lead to questionable results (see our Discussion).

      Nevertheless, we attempted an integration across all four species. To this end, we subset the cells for the 7 cell types that we found in all four species and visualized cell types and species in the UMAPs above (Author response image 2).

      We see that cardiac fibroblasts appear poorly integrated in the UMAP, but they still have very transferable marker genes across species. We quantified integration quality using the cell-specific mixing score (cms) (Lütge et al. 2021) and indeed found that the proportion of well integrated cells is lowest for cardiac fibroblasts (Author response image 3A). On the other end of the cms spectrum, neural crest cells appear to have the best integration across species, but their marker transferability between species is rather worse than for cardiac fibroblasts (Supplementary Figure 9). Cell-type wise calculated rank-biased overlap scores that we use for marker gene conservation show the same trends (Author response image 3B) as the F1 scores for marker gene transferability.  Hence, given our current dataset we do not see any indication that the low marker gene conservation is a result of too broadly defined cell types.

      Author response image 3.

      (A) Evaluation of species mixing per cell type in the Harmony-integrated dataset, quantified by the fraction of cells with an adjusted cell-specific mixing score (cms) above 0.05. (B) Summary of rank-biased overlap (RBO) scores per cell type to assess concordance of marker gene rankings for all species pairs.

      Reviewer #2 (Public review):

      Summary:

      The authors present an important study on identifying and comparing orthologous cell types across multiple species. This manuscript focuses on characterizing cell types in embryoid bodies (EBs) derived from induced pluripotent stem cells (iPSCs) of four primate species, humans, orangutans, cynomolgus macaques, and rhesus macaques, providing valuable insights into cross-species comparisons.

      Strengths:

      To achieve this, the authors developed a semi-automated computational pipeline that integrates classification and marker-based cluster annotation to identify orthologous cell types across primates. This study makes a significant contribution to the field by advancing cross-species cell type identification.

      We thank the reviewer for their positive and thoughtful feedback.

      Weaknesses:

      However, several critical points need to be addressed.

      (1) Use of Liftoff for GTF Annotation

      The authors used Liftoff to generate GTF files for Pongo abelii, Macaca fascicularis, and Macaca mulatta by transferring the hg38 annotation to the corresponding primate genomes. However, it is unclear why they did not use species-specific GTF files, as all these genomes have existing annotations. Why did the authors choose not to follow this approach?

      As Reviewer 1 also points out, also we have observed that the annotation of non-human primates often has truncated 3’UTRs. This is especially problematic for 3’ UMI transcriptome data as the ones in the 10x dataset that we present here. To illustrate this we compared the Liftoff annotation derived from Gencode v32,  that we also used throughout our manuscript to the Ensembl gene annotation Macaca_fascicularis_6.0.111. We used transcriptomes from human and cynomolgus iPSC bulk RNAseq  (Kliesmete et al. 2024) using the Prime-seq protocol (Janjic et al. 2022) which is very similar to 10x in that it also uses 3’ UMIs. On average using Liftoff produces higher counts than the Ensembl annotation (Author response image 4A). Moreover, when comparing across species, using Ensembl for the macaque leads to an asymmetry in differentially expressed genes, with apparently many more up-regulated genes in humans. In contrast, when we use the Liftoff annotation, we detect fewer DE-genes and a similar number of genes is up-regulated in macaques as in humans (Author response image 4B). We think that the many more DE-genes are artifacts due to mismatched annotation in human and cynomolgus macaques. We illustrate this for the case of the transcription factor SALL4 in Author response image 4 C,D.  The Ensembl annotation reports 2 transcripts, while Liftoff from Gencode v32 suggests 5 transcripts, one of which has a longer 3’UTR. This longer transcript is also supported by Nanopore data from macaque iPSCs. The truncation of the 3’UTR in this case leads to underestimation of the expression of SALL4 in macaques and hence SALL4 is detected as up-regulated in humans (DESeq2: LFC= 1.34, p-adj<2e-9). In contrast, when using the Liftoff annotation SALL4 does not appear to be DE between humans and macaques (LFC=0.33, p.adj=0.20).

      Author response image 4. 

      (A) UMI-counts/ gene for the same cynomolgus macaque iPSC samples. On the x-axis the gtf file from Ensembl Macaca_fascicularis_6.0.111 was used to count and on the y-axis we used our filtered Liftoff annotation that transferred the human gene models from Gencode v32. (B) The # of DE-genes between human  and cynomolgus iPSCs detected with DESeq2. In Liftoff, we counted human samples using Gencode v32 and compared it to the Liftoff annotation of the same human gene models to macFas6. In Ensembl, we use Gencode v32 for the human and  Ensembl Macaca_fascicularis_6.0.111 for the Macaque. For both comparisons we subset the genes to only contain one to one orthologues as annotated in biomart. Up and down regulation is relative to human expression. C) Read counts for one example gene SALL4. Here we used in addition to the Liftoff and Ensembl annotation also transcripts derived from Nanopore cDNA sequencing of cynomolgus iPSCs. D) Gene models for SALL4 in the space of MacFas6 and a coverage for iPSC-Prime-seq bulk RNA-sequencing.

      (2) Transcript Filtering and Potential Biases

      The authors excluded transcripts with partial mapping (<50%), low sequence identity (<50%), or excessive length differences (>100 bp and >2× length ratio). Such filtering may introduce biases in read alignment. Did the authors evaluate the impact of these filtering choices on alignment rates?

      We excluded those transcripts from analysis in both species, because they present a convolution of sequence-annotation differences and expression. The focus in our study is on regulatory evolution and we knowingly omit marker differences that are due to a marker being mutated away, we will make this clearer in the text of a revised version.

      (3) Data Integration with Harmony

      The methods section does not specify the parameters used for data integration with Harmony. Including these details would clarify how cross-species integration was performed.

      We want to stress  that none of our conservation and marker gene analyses relies on cross-species integration. We only used the Harmony integrated data for visualisation in Figure 1 and the rough germ-layer check up in Supplementary Figure S3.  We will add a better description in the revised version.

      References

      Janjic, Aleksandar, Lucas E. Wange, Johannes W. Bagnoli, Johanna Geuder, Phong Nguyen, Daniel Richter, Beate Vieth, et al. 2022. “Prime-Seq, Efficient and Powerful Bulk RNA Sequencing.” Genome Biology 23 (1): 88.

      Kliesmete, Zane, Peter Orchard, Victor Yan Kin Lee, Johanna Geuder, Simon M. Krauß, Mari Ohnuki, Jessica Jocher, Beate Vieth, Wolfgang Enard, and Ines Hellmann. 2024. “Evidence for Compensatory Evolution within Pleiotropic Regulatory Elements.” Genome Research 34 (10): 1528–39.

      Lütge, Almut, Joanna Zyprych-Walczak, Urszula Brykczynska Kunzmann, Helena L. Crowell, Daniela Calini, Dheeraj Malhotra, Charlotte Soneson, and Mark D. Robinson. 2021. “CellMixS: Quantifying and Visualizing Batch Effects in Single-Cell RNA-Seq Data.” Life Science Alliance 4 (6): e202001004.

      Moon, Kevin R., David van Dijk, Zheng Wang, Scott Gigante, Daniel B. Burkhardt, William S. Chen, Kristina Yim, et al. 2019. “Visualizing Structure and Transitions in High-Dimensional Biological Data.” Nature Biotechnology 37 (12): 1482–92.

      Persad, Sitara, Zi-Ning Choo, Christine Dien, Noor Sohail, Ignas Masilionis, Ronan Chaligné, Tal Nawy, et al. 2023. “SEACells Infers Transcriptional and Epigenomic Cellular States from Single-Cell Genomics Data.” Nature Biotechnology 41 (12): 1746–57.

      Street, Kelly, Davide Risso, Russell B. Fletcher, Diya Das, John Ngai, Nir Yosef, Elizabeth Purdom, and Sandrine Dudoit. 2018. “Slingshot: Cell Lineage and Pseudotime Inference for Single-Cell Transcriptomics.” BMC Genomics 19 (1): 477.

    1. Author response:

      Public Reviews:

      Reviewer #1 (Public review):

      We thank the reviewer for his valuable input and careful assessment, which have significantly improved the clarity and rigor of our manuscript.

      Summary:

      Mazer & Yovel 2025 dissect the inverse problem of how echolocators in groups manage to navigate their surroundings despite intense jamming using computational simulations.

      The authors show that despite the 'noisy' sensory environments that echolocating groups present, agents can still access some amount of echo-related information and use it to navigate their local environment. It is known that echolocating bats have strong small and large-scale spatial memory that plays an important role for individuals. The results from this paper also point to the potential importance of an even lower-level, short-term role of memory in the form of echo 'integration' across multiple calls, despite the unpredictability of echo detection in groups. The paper generates a useful basis to think about the mechanisms in echolocating groups for experimental investigations too.

      Strengths:

      (1) The paper builds on biologically well-motivated and parametrised 2D acoustics and sensory simulation setup to investigate the various key parameters of interest

      (2) The 'null-model' of echolocators not being able to tell apart objects & conspecifics while echolocating still shows agents successfully emerge from groups - even though the probability of emergence drops severely in comparison to cognitively more 'capable' agents. This is nonetheless an important result showing the direction-of-arrival of a sound itself is the 'minimum' set of ingredients needed for echolocators navigating their environment.

      (3) The results generate an important basis in unraveling how agents may navigate in sensorially noisy environments with a lot of irrelevant and very few relevant cues.

      (4) The 2D simulation framework is simple and computationally tractable enough to perform multiple runs to investigate many variables - while also remaining true to the aim of the investigation.

      Weaknesses:

      There are a few places in the paper that can be misunderstood or don't provide complete details. Here is a selection:

      (1) Line 61: '... studies have focused on movement algorithms while overlooking the sensory challenges involved' : This statement does not match the recent state of the literature. While the previous models may have had the assumption that all neighbours can be detected, there are models that specifically study the role of limited interaction arising from a potential inability to track all neighbours due to occlusion, and the effect of responding to only one/few neighbours at a time e.g. Bode et al. 2011 R. Soc. Interface, Rosenthal et al. 2015 PNAS, Jhawar et al. 2020 Nature Physics.

      We appreciate the reviewer's comment and the relevant references. We have revised the manuscript accordingly to clarify the distinction between studies that incorporate limited interactions and those that explicitly analyze sensory constraints and interference. We have refined our statement to acknowledge these contributions while maintaining our focus on sensory challenges beyond limited neighbor detection, such as signal degradation, occlusion effects, and multimodal sensory integration (see lines 61-64):

      While collective movement has been extensively studied in various species, including insect swarming, fish schooling, and bird murmuration (Pitcher, Partridge and Wardle, 1976; Partridge, 1982; Strandburg-Peshkin et al., 2013; Pearce et al., 2014; Rosenthal, Twomey, Hartnett, Wu, Couzin, et al., 2015; Bastien and Romanczuk, 2020; Davidson et al., 2021; Aidan, Bleichman and Ayali, 2024), as well as in swarm robotics agents performing tasks such as coordinated navigation and maze-solving (Faria Dias et al., 2021; Youssefi and Rouhani, 2021; Cheraghi, Shahzad and Graffi, 2022), most studies have focused on movement algorithms , often assuming full detection of neighbors (Parrish and Edelstein-Keshet, 1999; Couzin et al., 2002, 2005; Sumpter et al., 2008; Nagy et al., 2010; Bialek et al., 2012; Gautrais et al., 2012; Attanasi et al., 2014). Some models have incorporated limited interaction rules where individuals respond to one or a few neighbors due to sensory constraints (Bode, Franks and Wood, 2011; Jhawar et al., 2020). However, fewer studies explicitly examine how sensory interference, occlusion, and noise shape decision-making in collective systems (Rosenthal et al., 2015).

      (2) The word 'interference' is used loosely places (Line 89: '...took all interference signals...', Line 319: 'spatial interference') - this is confusing as it is not clear whether the authors refer to interference in the physics/acoustics sense, or broadly speaking as a synonym for reflections and/or jamming.

      To improve clarity, we have revised the manuscript to distinguish between different types of interference:

      · Acoustic interference (jamming): Overlapping calls that completely obscure echo detection, preventing bats from perceiving necessary environmental cues.

      · Acoustic interference (masking): Partial reduction in signal clarity due to competing calls.

      · Spatial interference: Physical obstruction by conspecifics affecting movement and navigation.

      We have updated the manuscript to use these terms consistently and explicitly define them in relevant sections (see lines 87-94 and 329-330). This distinction ensures that the reader can differentiate between interference as an acoustic phenomenon and its broader implications in navigation.

      (3) The paper discusses original results without reference to how they were obtained or what was done. The lack of detail here must be considered while interpreting the Discussion e.g. Line 302 ('our model suggests...increasing the call-rate..' - no clear mention of how/where call-rate was varied) & Line 323 '..no benefit beyond a certain level..' - also no clear mention of how/where call-level was manipulated in the simulations.

      All tested parameters, including call rate dynamics and call intensity variations, are detailed in the Methods section and Tables 1 and 2. Specifically:

      · Call Rate Variation: The Inter-Pulse Interval (IPI) was modeled based on documented echolocation behavior, decreasing from 100 msec during the search phase to 35 msec (~28 calls per second) at the end of the approach phase, and to 5 msec (200 calls per second) during the final buzz (see Table 2). This natural variation in call rate was not manually manipulated in the model but emerged from the simulated bat behavior.

      · Call Intensity Variation: The tested call intensity levels (100, 110, 120, 130 dB SPL) are presented in Table 1 under the “Call Level” parameter. The effect of increasing call intensity was analyzed in relation to exit probability, jamming probability, and collision rate. This is now explicitly referenced in the Discussion.

      We have revised the manuscript to explicitly reference these aspects in the Results and Discussion sections.

      Reviewer #2 (Public review):

      We are grateful for the reviewer’s insightful feedback, which has helped us clarify key aspects of our research and strengthen our conclusions.

      This manuscript describes a detailed model of bats flying together through a fixed geometry. The model considers elements that are faithful to both bat biosonar production and reception and the acoustics governing how sound moves in the air and interacts with obstacles. The model also incorporates behavioral patterns observed in bats, like one-dimensional feature following and temporal integration of cognitive maps. From a simulation study of the model and comparison of the results with the literature, the authors gain insight into how often bats may experience destructive interference of their acoustic signals and those of their peers, and how much such interference may actually negatively affect the groups' ability to navigate effectively. The authors use generalized linear models to test the significance of the effects they observe.

      In terms of its strengths, the work relies on a thoughtful and detailed model that faithfully incorporates salient features, such as acoustic elements like the filter for a biological receiver and temporal aggregation as a kind of memory in the system. At the same time, the authors' abstract features are complicating without being expected to give additional insights, as can be seen in the choice of a two-dimensional rather than three-dimensional system. I thought that the level of abstraction in the model was perfect, enough to demonstrate their results without needless details. The results are compelling and interesting, and the authors do a great job discussing them in the context of the biological literature.

      The most notable weakness I found in this work was that some aspects of the model were not entirely clear to me.

      For example, the directionality of the bat's sonar call in relation to its velocity. Are these the same?

      For simplicity, in our model, the head is aligned with the body, therefore the direction of the echolocation beam is the same as the direction of the flight.

      Moreover, call directionality (directivity) is not directly influenced by velocity. Instead, directionality is estimated using the piston model, as described in the Methods section. The directionality is based on the emission frequency and is thus primarily linked to the behavioral phases of the bat, with frequency shifts occurring as the bat transitions from search to approach to buzz phases. During the approach phase, the bat emits calls with higher frequencies, resulting in increased directionality. This is supported by the literature (Jakobsen and Surlykke, 2010; Jakobsen, Brinkløv and Surlykke, 2013). This phase is also associated with a natural reduction in flight speed, which is a well-documented behavioral adaptation in echolocating bats (Jakobsen et al., 2024).

      To clarify this in the manuscript, we have updated the text to explicitly state that directionality follows phase-dependent frequency changes rather than being a direct function of velocity, see lines 460-465.

      If so, what is the difference between phi_target and phi_tx in the model equations?

      represents the angle between the bat and the reflected object (target).

      the angle [rad], between the masking bat and target (from the transmitter’s perspective)

      refers to the angle between the transmitting conspecific and the receiving focal bat, from the transmitter’s point of view.

      represents the angle between the receiving bat and the transmitting bat, from the receiver’s point of view.

      These definitions have been explicitly stated in the revised manuscript to prevent any ambiguity (lines 467-468). Additionally, a Supplementary figure demonstrating the geometrical relations has been added to the manuscript.

      Author response image 1.

      What is a bat's response to colliding with a conspecific (rather than a wall)?

      In nature, minor collisions between bats are common and typically do not result in significant disruptions to flight (Boerma et al., 2019; Roy et al., 2019; Goldstein et al., 2024).Given this, our model does not explicitly simulate the physical impact of a collision event. Instead, during the collision event the bat keeps decreasing its velocity and changing its flight direction until the distance between bats is above the threshold (0.4 m). We assume that the primary cost of such interactions arises from the effort required to avoid collisions, rather than from the collision itself. This assumption aligns with observations of bat behavior in dense flight environments, where individuals prioritize collision avoidance rather than modeling post-collision dynamics.

      From the statistical side, it was not clear if replicate simulations were performed. If they were, which I believe is the right way due to stochasticity in the model, how many replicates were used, and are the standard errors referred to throughout the paper between individuals in the same simulation or between independent simulations, or both?

      The number of repetitions for each scenario is detailed in Table 1, but we included it in a more prominent location in the text for clarity. Specifically, we now state (Lines 274-275):

      "The number of repetitions for each scenario was as follows: 1 bat: 240; 2 bats: 120; 5 bats: 48; 10 bats: 24; 20 bats: 12; 40 bats: 12; 100 bats: 6."

      Regarding the reported standard errors, they are calculated across all individuals within each scenario, without distinguishing between different simulation trials.

      We clarified in the revised text (Lines 534-535 in Statistical Analysis)

      Overall, I found these weaknesses to be superficial and easily remedied by the authors. The authors presented well-reasoned arguments that were supported by their results, and which were used to demonstrate how call interference impacts the collective's roost exit as measured by several variables. As the authors highlight, I think this work is valuable to individuals interested in bat biology and behavior, as well as to applications in engineered multi-agent systems like robotic swarms.

      Reviewer #3 (Public review):

      We sincerely appreciate the reviewer’s thoughtful comments and the time invested in evaluating our work, which have greatly contributed to refining our study.

      We would like to note that in general, our model often simplifies some of the bats’ abilities, under the assumption that if the simulated bats manage to perform this difficult task with simpler mechanisms, real better adapted bats will probably perform even better. This thought strategy will be repeated in several of the answers below.

      Summary:

      The authors describe a model to mimic bat echolocation behavior and flight under high-density conditions and conclude that the problem of acoustic jamming is less severe than previously thought, conflating the success of their simulations (as described in the manuscript) with hard evidence for what real bats are actually doing. The authors base their model on two species of bats that fly at "high densities" (defined by the authors as colony sizes from tens to tens of thousands of individuals and densities of up to 33.3 bats/m2), Pipistrellus kuhli and Rhinopoma microphyllum. This work fits into the broader discussion of bat sensorimotor strategies during collective flight, and simulations are important to try to understand bat behavior, especially given a lack of empirical data. However, I have major concerns about the assumptions of the parameters used for the simulation, which significantly impact both the results of the simulation and the conclusions that can be made from the data. These details are elaborated upon below, along with key recommendations the authors should consider to guide the refinement of the model.

      Strengths:

      This paper carries out a simulation of bat behavior in dense swarms as a way to explain how jamming does not pose a problem in dense groups. Simulations are important when we lack empirical data. The simulation aims to model two different species with different echolocation signals, which is very important when trying to model echolocation behavior. The analyses are fairly systematic in testing all ranges of parameters used and discussing the differential results.

      Weaknesses:

      The justification for how the different foraging phase call types were chosen for different object detection distances in the simulation is unclear. Do these distances match those recorded from empirical studies, and if so, are they identical for both species used in the simulation?

      The distances at which bats transition between echolocation phases are identical for both species in our model (see Table 2). These distances are based on well-documented empirical studies of bat hunting and obstacle avoidance behavior (Griffin, Webster and Michael, 1958; Simmons and Kick, 1983; Schnitzler et al., 1987; Kalko, 1995; Hiryu et al., 2008; Vanderelst and Peremans, 2018). These references provide extensive evidence that insectivorous bats systematically adjust their echolocation calls in response to object proximity, following the characteristic phases of search, approach, and buzz.

      To improve clarity, we have updated the text to explicitly state that the phase transition distances are empirically grounded and apply equally to both modeled species (lines 430-447).

      What reasoning do the authors have for a bat using the same call characteristics to detect a cave wall as they would for detecting a small insect?

      In echolocating bats, call parameters are primarily shaped by the target distance and echo strength. Accordingly, there is little difference in call structure between prey capture and obstacles-related maneuvers, aside from intensity adjustments based on target strength (Hagino et al., 2007; Hiryu et al., 2008; Surlykke, Ghose and Moss, 2009; Kothari et al., 2014). In our study, due to the dense cave environment, the bats are found to operate in the approach phase nearly all the time, which is consistent with natural cave emergence, where they are navigating through a cluttered environment rather than engaging in open-space search. For one of the species (Rhinopoma M.), we also have empirical recordings of individuals flying under similar conditions (Goldstein et al., 2024). Our model was designed to remain as simple as possible while relying on conservative assumptions that may underestimate bat performance. If, in reality, bats fine-tune their echolocation calls even earlier or more precisely during navigation than assumed, our model would still conservatively reflect their actual capabilities.

      We actually used logarithmically frequency modulated (FM) chirps, generated using the MATLAB built-in function chirp(t, f0, t1, f1, 'logarithmic'). This method aligns with the nonlinear FM characteristics of Pipistrellus kuhlii (PK) and Rhinopoma microphyllum (RM) and provides a realistic approximation of their echolocation signals. We acknowledge that this was not sufficiently emphasized in the original text, and we have now explicitly highlighted this in the revised version to ensure clarity (sell Lines 447-449 in Methods).

      The two species modeled have different calls. In particular, the bandwidth varies by a factor of 10, meaning the species' sonars will have different spatial resolutions. Range resolution is about 10x better for PK compared to RM, but the authors appear to use the same thresholds for "correct detection" for both, which doesn't seem appropriate.

      The detection process in our model is based on Saillant’s method using a filter bank, as detailed in the paper (Saillant et al., 1993; Neretti et al., 2003; Sanderson et al., 2003). This approach inherently incorporates the advantages of a wider bandwidth, meaning that the differences in range resolution between the species are already accounted for within the signal-processing framework. Thus, there is no need to explicitly adjust the model parameters for bandwidth variations, as these effects emerge from the applied method.

      Also, the authors did not mention incorporating/correcting for/exploiting Doppler, which leads me to assume they did not model it.

      The reviewer is correct. To maintain model simplicity, we did not incorporate the Doppler effect or its impact on echolocation. The exclusion of Doppler effects was based on the assumption that while Doppler shifts can influence frequency perception, their impact on jamming and overall navigation performance is minor within the modelled context.

      The maximal Doppler shifts expected for the bats in this scenario are of ~ 1kHz. These shifts would be applied variably across signals due to the semi-random relative velocities between bats, leading to a mixed effect on frequency changes. This variability would likely result in an overall reduction in jamming rather than exacerbating it, aligning with our previous statement that our model may overestimate the severity of acoustic interference. Such Doppler shifts would result in errors of 2-4 cm in localization (i.e., 200-400 micro-seconds) (Boonman, Parsons and Jones, 2003). 

      We have now explicitly highlighted this in the revised version (see Lines 468-470).

      The success of the simulation may very well be due to variation in the calls of the bats, which ironically enough demonstrates the importance of a jamming avoidance response in dense flight. This explains why the performance of the simulation falls when bats are not able to distinguish their own echoes from other signals. For example, in Figure C2, there are calls that are labeled as conspecific calls and have markedly shorter durations and wider bandwidths than others. These three phases for call types used by the authors may be responsible for some (or most) of the performance of the model since the correlation between different call types is unlikely to exceed the detection threshold. But it turns out this variation in and of itself is what a jamming avoidance response may consist of. So, in essence, the authors are incorporating a jamming avoidance response into their simulation.

      We fully agree that the natural variations in call design between the phases contribute significantly to interference reduction (see our discussion in a previous paper in Mazar & Yovel, 2020). However, we emphasize that this cannot be classified as a Jamming Avoidance Response (JAR). In our model, bats respond only to the physical presence of objects and not to the acoustic environment or interference itself. There is no active or adaptive adjustment of call design to minimize jamming beyond the natural phase-dependent variations in call structure. Therefore, while variation in call types does inherently reduce interference, this effect emerges passively from the modeled behavior rather than as an intentional strategy to avoid jamming.

      The authors claim that integration over multiple pings (though I was not able to determine the specifics of this integration algorithm) reduces the masking problem. Indeed, it should: if you have two chances at detection, you've effectively increased your SNR by 3dB.

      The reviewer is correct. Indeed, integration over multiple calls improves signal-to-noise ratio (SNR), effectively increasing it by approximately 3 dB per doubling of observations. The specifics of the integration algorithm are detailed in the Methods section, where we describe how sensory information is aggregated across multiple time steps to enhance detection reliability.

      They also claim - although it is almost an afterthought - that integration dramatically reduces the degradation caused by false echoes. This also makes sense: from one ping to the next, the bat's own echo delays will correlate extremely well with the bat's flight path. Echo delays due to conspecifics will jump around kind of randomly. However, the main concern is regarding the time interval and number of pings of the integration, especially in the context of the bat's flight speed. The authors say that a 1s integration interval (5-10 pings) dramatically reduces jamming probability and echo confusion. This number of pings isn't very high, and it occurs over a time interval during which the bat has moved 5-10m. This distance is large compared to the 0.4m distance-to-obstacle that triggers an evasive maneuver from the bat, so integration should produce a latency in navigation that significantly hinders the ability to avoid obstacles. Can the authors provide statistics that describe this latency, and discussion about why it doesn't seem to be a problem?

      As described in the Methods section, the bat’s collision avoidance response does not solely rely on the integration process. Instead, the model incorporates real-time echoes from the last calls, which are used independently of the integration process for immediate obstacle avoidance maneuvers. This ensures that bats can react to nearby obstacles without being hindered by the integration latency. The slower integration on the other hand is used for clustering, outlier removal and estimation wall directions to support the pathfinding process, as illustrated in Supplementary Figure 1.

      Additionally, our model assumes that bats store the physical positions of echoes in an allocentric coordinate system (x-y). The integration occurs after transforming these detections from a local relative reference frame to a global spatial representation. This allows for stable environmental mapping while maintaining responsiveness to immediate changes in the bat’s surroundings.

      See lines 518-523 in the revied version.

      The authors are using a 2D simulation, but this very much simplifies the challenge of a 3D navigation task, and there is an explanation as to why this is appropriate. Bat densities and bat behavior are discussed per unit area when realistically it should be per unit volume. In fact, the authors reference studies to justify the densities used in the simulation, but these studies were done in a 3D world. If the authors have justification for why it is realistic to model a 3D world in a 2D simulation, I encourage them to provide references justifying this approach.

      We acknowledge that this is a simplification; however, from an echolocation perspective, a 2D framework represents a worst-case scenario in terms of bat densities and maneuverability:

      · Higher Effective Density: A 2D model forces all bats into a single plane rather than distributing them through a 3D volume, increasing the likelihood of overlap in calls and echoes and making jamming more severe. As described in the text: the average distance to the nearest bat in our simulation is 0.27m (with 100 bats), whereas reported distances in very dense colonies are 0.5m, as observed in Myotis grisescens and Tadarida brasiliensis (Fujioka et al., 2021; Sabol and Hudson, 1995; Betke et al., 2008; Gillam et al, 2010)

      · Reduced Maneuverability: In 3D space, bats can use vertical movement to avoid obstacles and conspecifics. A 2D constraint eliminates this degree of freedom, increasing collision risk and limiting escape options.

      Thus, our 2D model provides a conservative difficult test case, ensuring that our findings are valid under conditions where jamming and collision risks are maximized. Additionally, the 2D framework is computationally efficient, allowing us to perform multiple simulation runs to explore a broad parameter space and systematically test the impact of different variables.

      To address the reviewer’s concern, we have clarified this justification in the revised text and will provide supporting references where applicable: (see Methods lines 407-412)

      The focus on "masking" (which appears to be just in-band noise), especially relative to the problem of misassigned echoes, is concerning. If the bat calls are all the same waveform (downsweep linear FM of some duration, I assume - it's not clear from the text), false echoes would be a major problem. Masking, as the authors define it, just reduces SNR. This reduction is something like sqrt(N), where N is the number of conspecifics whose echoes are audible to the bat, so this allows the detection threshold to be set lower, increasing the probability that a bat's echo will exceed a detection threshold. False echoes present a very different problem. They do not reduce SNR per se, but rather they cause spurious threshold excursions (N of them!) that the bat cannot help but interpret as obstacle detection. I would argue that in dense groups the mis-assignment problem is much more important than the SNR problem.

      There is substantial literature supporting the assumption that bats can recognize their own echoes and distinguish them from conspecific signals (Schnitzler and Bioscience, 2001‏; Kazial, Burnett and Masters, 2001; Burnett and Masters, 2002; Kazial, Kenny and Burnett, 2008; Chili, Xian and Moss, 2009; Yovel et al., 2009; Beetz and Hechavarría, 2022). However, we acknowledge that false echoes may present a major challenge in dense groups. To address this, we explicitly tested the impact of the self-echo identification assumption in our study see Results Figure 4: The impact of confusion on performance, and lines 345-355 in the Discussion.

      Furthermore, we examined a full confusion scenario, where all reflected echoes from conspecifics were misinterpreted as obstacle reflections (i.e., 100% confusion). Our results show that this significantly degrades navigation performance, supporting the argument that echo misassignment is a critical issue. However, we also explored a simple mitigation strategy based on temporal integration with outlier rejection, which provided some improvement in performance. This suggests that real bats may possess additional mechanisms to enhance self-echo identification and reduce false detections. See lines XX in the manuscript for further discussion.

      The criteria set for flight behavior (lines 393-406) are not justified with any empirical evidence of the flight behavior of wild bats in collective flight. How did the authors determine the avoidance distances? Also, what is the justification for the time limit of 15 seconds to emerge from the opening? Instead of an exit probability, why not instead use a time criterion, similar to "How long does it take X% of bats to exit?"

      While we acknowledge that wild bats may employ more complex behaviors for collision avoidance, we chose to implement a simplified decision-making rule in our model to maintain computational tractability.

      The avoidance distances (1.5 m from walls and 0.4 m from other bats) were selected as internal parameters to ensure coherent flight trajectories while maintaining a reasonable collision rate. These distances provide a balance between maneuverability and stability, preventing erratic flight patterns while still enabling effective obstacle avoidance. In the revised paper, we have added supplementary figures illustrating the effect of model parameters on performance, specifically focusing on the avoidance distance.

      The 15-second exit limit was determined as described in the text (Lines 403-404): “A 15-second window was chosen because it is approximately twice the average exit time for 40 bats and allows for a second corrective maneuver if needed.” In other words, it allowed each bat to circle the ‘cave’ twice to exit even in the most crowded environment. This threshold was set to keep simulation time reasonable while allowing sufficient time for most bats to exit successfully.

      We acknowledge that the alternative approach suggested by the reviewer—measuring the time taken for a certain percentage of bats to exit—is also valid. However, in our model, some outlier bats fail to exit and continue flying for many minutes, Such simulations would lead to excessive simulation times making it difficult to generate repetitions and not teaching us much – they usually resulted from the bat slightly missing the opening (see video S1. Our chosen approach ensures practical runtime constraints while still capturing relevant performance metrics.

      What is the empirical justification for the 1-10 calls used for integration?

      The "average exit time for 40 bats" is also confusing and not well explained. Was this determined empirically? From the simulation? If the latter, what are the conditions? Does it include masking, no masking, or which species?

      Previous studies have demonstrated that bats integrate acoustic information received sequentially over several echolocation calls (2-15), effectively constructing an auditory scene in complex environments (Ulanovsky and Moss, 2008; Chili, Xian and Moss, 2009; Moss and Surlykke, 2010; Yovel and Ulanovsky, 2017; Salles, Diebold and Moss, 2020). Additionally, bats are known to produce echolocation sound groups when spatiotemporal localization demands are high (Kothari et al., 2014). Studies have documented call sequences ranging from 2 to 15 grouped calls (Moss et al., 2010), and it has been hypothesized that grouping facilitates echo segregation.

      We did not use a single integration window - we tested integration sizes between 1 and 10 calls and presented the results in Figure 3A. This range was chosen based on prior empirical findings and to explore how different levels of temporal aggregation impact navigation performance. Indeed, the results showed that the performance levels between 5-10 calls integration window (Figure 3A)

      Regarding the average exit time for 40 bats, this value was determined from our simulations, where it represents the mean time for successful exits under standard conditions with masking.

      We have revised the text to clarify these details see, lines 466.

      References:

      Aidan, Y., Bleichman, I. and Ayali, A. (2024) ‘Pausing to swarm: locust intermittent motion is instrumental for swarming-related visual processing’, Biology letters, 20(2), p. 20230468. Available at: https://doi.org/10.1098/rsbl.2023.0468.

      Attanasi, A. et al. (2014) ‘Collective Behaviour without Collective Order in Wild Swarms of Midges’. Edited by T. Vicsek, 10(7). Available at: https://doi.org/10.1371/journal.pcbi.1003697.

      Bastien, R. and Romanczuk, P. (2020) ‘A model of collective behavior based purely on vision’, Science Advances, 6(6). Available at: https://doi.org/10.1126/sciadv.aay0792.

      Beetz, M.J. and Hechavarría, J.C. (2022) ‘Neural Processing of Naturalistic Echolocation Signals in Bats’, Frontiers in Neural Circuits, 16, p. 899370. Available at: https://doi.org/10.3389/FNCIR.2022.899370/BIBTEX.

      Betke, M. et al. (2008) ‘Thermal Imaging Reveals Significantly Smaller Brazilian Free-Tailed Bat Colonies Than Previously Estimated’, Journal of Mammalogy, 89(1), pp. 18–24. Available at: https://doi.org/10.1644/07-MAMM-A-011.1.

      Bialek, W. et al. (2012) ‘Statistical mechanics for natural flocks of birds’, Proceedings of the National Academy of Sciences, 109(13), pp. 4786–4791. Available at: https://doi.org/10.1073/PNAS.1118633109.

      Bode, N.W.F., Franks, D.W. and Wood, A.J. (2011) ‘Limited interactions in flocks: Relating model simulations to empirical data’, Journal of the Royal Society Interface, 8(55), pp. 301–304. Available at: https://doi.org/10.1098/RSIF.2010.0397.

      Boerma, D.B. et al. (2019) ‘Wings as inertial appendages: How bats recover from aerial stumbles’, Journal of Experimental Biology, 222(20). Available at: https://doi.org/10.1242/JEB.204255/VIDEO-3.

      Boonman, A.M., Parsons, S. and Jones, G. (2003) ‘The influence of flight speed on the ranging performance of bats using frequency modulated echolocation pulses’, The Journal of the Acoustical Society of America, 113(1), p. 617. Available at: https://doi.org/10.1121/1.1528175.

      Burnett, S.C. and Masters, W.M. (2002) ‘Identifying Bats Using Computerized Analysis and Artificial Neural Networks’, North American Symposium on Bat Research, 9.

      Cheraghi, A.R., Shahzad, S. and Graffi, K. (2022) ‘Past, Present, and Future of Swarm Robotics’, in Lecture Notes in Networks and Systems. Available at: https://doi.org/10.1007/978-3-030-82199-9_13.

      Chili, C., Xian, W. and Moss, C.F. (2009) ‘Adaptive echolocation behavior in bats for the analysis of auditory scenes’, Journal of Experimental Biology, 212(9), pp. 1392–1404. Available at: https://doi.org/10.1242/jeb.027045.

      Couzin, I.D. et al. (2002) ‘Collective Memory and Spatial Sorting in Animal Groups’, Journal of Theoretical Biology, 218(1), pp. 1–11. Available at: https://doi.org/10.1006/jtbi.2002.3065.

      Couzin, I.D. et al. (2005) ‘Effective leadership and decision-making in animal groups on the move’, Nature, 433(7025), pp. 513–516. Available at: https://doi.org/10.1038/nature03236.

      Davidson, J.D. et al. (2021) ‘Collective detection based on visual information in animal groups’, Journal of the Royal Society, 18(180), p. 2021.02.18.431380. Available at: https://doi.org/10.1098/rsif.2021.0142.

      Faria Dias, P.G. et al. (2021) ‘Swarm robotics: A perspective on the latest reviewed concepts and applications’, Sensors. Available at: https://doi.org/10.3390/s21062062.

      Fujioka, E. et al. (2021) ‘Three-Dimensional Trajectory Construction and Observation of Group Behavior of Wild Bats During Cave Emergence’, Journal of Robotics and Mechatronics, 33(3), pp. 556–563. Available at: https://doi.org/10.20965/jrm.2021.p0556.

      Gautrais, J. et al. (2012) ‘Deciphering Interactions in Moving Animal Groups’, PLOS Computational Biology, 8(9), p. e1002678. Available at: https://doi.org/10.1371/JOURNAL.PCBI.1002678.

      Gillam, E.H. et al. (2010) ‘Echolocation behavior of Brazilian free-tailed bats during dense emergence flights’, Journal of Mammalogy, 91(4), pp. 967–975. Available at: https://doi.org/10.1644/09-MAMM-A-302.1.

      Goldstein, A. et al. (2024) ‘Collective Sensing – On-Board Recordings Reveal How Bats Maneuver Under Severe 4 Acoustic Interference’, Under Review, pp. 1–25.

      Griffin, D.R., Webster, F.A. and Michael, C.R. (1958) ‘THE ECHOLOCATION OF FLYING INSECTS BY BATS ANIMAL BEHAVIOUR , Viii , 3-4’.

      Hagino, T. et al. (2007) ‘Adaptive SONAR sounds by echolocating bats’, International Symposium on Underwater Technology, UT 2007 - International Workshop on Scientific Use of Submarine Cables and Related Technologies 2007, pp. 647–651. Available at: https://doi.org/10.1109/UT.2007.370829.

      Hiryu, S. et al. (2008) ‘Adaptive echolocation sounds of insectivorous bats, Pipistrellus abramus, during foraging flights in the field’, The Journal of the Acoustical Society of America, 124(2), pp. EL51–EL56. Available at: https://doi.org/10.1121/1.2947629.

      Jakobsen, L. et al. (2024) ‘Velocity as an overlooked driver in the echolocation behavior of aerial hawking vespertilionid bats’. Available at: https://doi.org/10.1016/j.cub.2024.12.042.

      Jakobsen, L., Brinkløv, S. and Surlykke, A. (2013) ‘Intensity and directionality of bat echolocation signals’, Frontiers in Physiology, 4 APR(April), pp. 1–9. Available at: https://doi.org/10.3389/fphys.2013.00089.

      Jakobsen, L. and Surlykke, A. (2010) ‘Vespertilionid bats control the width of their biosonar sound beam dynamically during prey pursuit’, 107(31). Available at: https://doi.org/10.1073/pnas.1006630107.

      Jhawar, J. et al. (2020) ‘Noise-induced schooling of fish’, Nature Physics 2020 16:4, 16(4), pp. 488–493. Available at: https://doi.org/10.1038/s41567-020-0787-y.

      Kalko, E.K. V. (1995) ‘Insect pursuit, prey capture and echolocation in pipistrelle bats (Microchirptera)’, Animal Behaviour, 50(4), pp. 861–880.

      Kazial, K.A., Burnett, S.C. and Masters, W.M. (2001) ‘ Individual and Group Variation in Echolocation Calls of Big Brown Bats, Eptesicus Fuscus (Chiroptera: Vespertilionidae) ’, Journal of Mammalogy, 82(2), pp. 339–351. Available at: https://doi.org/10.1644/1545-1542(2001)082<0339:iagvie>2.0.co;2.

      Kazial, K.A., Kenny, T.L. and Burnett, S.C. (2008) ‘Little brown bats (Myotis lucifugus) recognize individual identity of conspecifics using sonar calls’, Ethology, 114(5), pp. 469–478. Available at: https://doi.org/10.1111/j.1439-0310.2008.01483.x.

      Kothari, N.B. et al. (2014) ‘Timing matters: Sonar call groups facilitate target localization in bats’, Frontiers in Physiology, 5 MAY. Available at: https://doi.org/10.3389/fphys.2014.00168.

      Moss, C.F. and Surlykke, A. (2010) ‘Probing the natural scene by echolocation in bats’, Frontiers in Behavioral Neuroscience. Available at: https://doi.org/10.3389/fnbeh.2010.00033.

      Nagy, M. et al. (2010) ‘Hierarchical group dynamics in pigeon flocks’, Nature 2010 464:7290, 464(7290), pp. 890–893. Available at: https://doi.org/10.1038/nature08891.

      Neretti, N. et al. (2003) ‘Time-frequency model for echo-delay resolution in wideband biosonar’, The Journal of the Acoustical Society of America, 113(4), pp. 2137–2145. Available at: https://doi.org/10.1121/1.1554693.

      Parrish, J.K. and Edelstein-Keshet, L. (1999) ‘Complexity, Pattern, and Evolutionary Trade-Offs in Animal Aggregation’, Science, 284(5411), pp. 99–101. Available at: https://doi.org/10.1126/SCIENCE.284.5411.99.

      Partridge, B.L. (1982) ‘The Structure and Function of Fish Schools’, 246(6), pp. 114–123. Available at: https://doi.org/10.2307/24966618.

      Pearce, D.J.G. et al. (2014) ‘Role of projection in the control of bird flocks’, Proceedings of the National Academy of Sciences of the United States of America, 111(29), pp. 10422–10426. Available at: https://doi.org/10.1073/pnas.1402202111.

      Pitcher, T.J., Partridge, B.L. and Wardle, C.S. (1976) ‘A blind fish can school’, Science, 194(4268), pp. 963–965. Available at: https://doi.org/10.1126/science.982056.

      Rosenthal, S.B., Twomey, C.R., Hartnett, A.T., Wu, H.S., Couzin, I.D., et al. (2015) ‘Revealing the hidden networks of interaction in mobile animal groups allows prediction of complex behavioral contagion’, Proceedings of the National Academy of Sciences of the United States of America, 112(15), pp. 4690–4695. Available at: https://doi.org/10.1073/pnas.1420068112.

      Rosenthal, S.B., Twomey, C.R., Hartnett, A.T., Wu, H.S. and Couzin, I.D. (2015) ‘Revealing the hidden networks of interaction in mobile animal groups allows prediction of complex behavioral contagion’, Proceedings of the National Academy of Sciences of the United States of America, 112(15), pp. 4690–4695. Available at: https://doi.org/10.1073/PNAS.1420068112/-/DCSUPPLEMENTAL/PNAS.1420068112.SAPP.PDF.

      Roy, S. et al. (2019) ‘Extracting interactions between flying bat pairs using model-free methods’, Entropy, 21(1). Available at: https://doi.org/10.3390/e21010042.

      Sabol, B.M. and Hudson, M.K. (1995) ‘Technique using thermal infrared-imaging for estimating populations of gray bats’, Journal of Mammalogy, 76(4). Available at: https://doi.org/10.2307/1382618.

      Saillant, P.A. et al. (1993) ‘A computational model of echo processing and acoustic imaging in frequency- modulated echolocating bats: The spectrogram correlation and transformation receiver’, The Journal of the Acoustical Society of America, 94(5). Available at: https://doi.org/10.1121/1.407353.

      Salles, A., Diebold, C.A. and Moss, C.F. (2020) ‘Echolocating bats accumulate information from acoustic snapshots to predict auditory object motion’, Proceedings of the National Academy of Sciences of the United States of America, 117(46), pp. 29229–29238. Available at: https://doi.org/10.1073/PNAS.2011719117/SUPPL_FILE/PNAS.2011719117.SAPP.PDF.

      Sanderson, M.I. et al. (2003) ‘Evaluation of an auditory model for echo delay accuracy in wideband biosonar’, The Journal of the Acoustical Society of America, 114(3), pp. 1648–1659. Available at: https://doi.org/10.1121/1.1598195.

      Schnitzler, H., Bioscience, E.K.- and 2001‏, undefined (no date) ‘Echolocation by insect-eating bats: we define four distinct functional groups of bats and find differences in signal structure that correlate with the typical echolocation ‏’, academic.oup.com‏HU Schnitzler, EKV Kalko‏Bioscience, 2001‏•academic.oup.com‏ [Preprint]. Available at: https://academic.oup.com/bioscience/article-abstract/51/7/557/268230 (Accessed: 17 March 2025).

      Schnitzler, H.-U. et al. (1987) ‘The echolocation and hunting behavior of the bat,Pipistrellus kuhli’, Journal of Comparative Physiology A, 161(2), pp. 267–274. Available at: https://doi.org/10.1007/BF00615246.

      Simmons, J.A. and Kick, S.A. (1983) ‘Interception of Flying Insects by Bats’, Neuroethology and Behavioral Physiology, pp. 267–279. Available at: https://doi.org/10.1007/978-3-642-69271-0_20.

      Strandburg-Peshkin, A. et al. (2013) ‘Visual sensory networks and effective information transfer in animal groups’, Current Biology. Cell Press. Available at: https://doi.org/10.1016/j.cub.2013.07.059.

      Sumpter, D.J.T. et al. (2008) ‘Consensus Decision Making by Fish’, Current Biology, 18(22), pp. 1773–1777. Available at: https://doi.org/10.1016/J.CUB.2008.09.064.

      Surlykke, A., Ghose, K. and Moss, C.F. (2009) ‘Acoustic scanning of natural scenes by echolocation in the big brown bat, Eptesicus fuscus’, Journal of Experimental Biology, 212(7), pp. 1011–1020. Available at: https://doi.org/10.1242/JEB.024620.

      Theriault, D.H. et al. (no date) ‘Reconstruction and analysis of 3D trajectories of Brazilian free-tailed bats in flight‏’, cs-web.bu.edu‏ [Preprint]. Available at: https://cs-web.bu.edu/faculty/betke/papers/2010-027-3d-bat-trajectories.pdf (Accessed: 4 May 2023).

      Ulanovsky, N. and Moss, C.F. (2008) ‘What the bat’s voice tells the bat’s brain’, Proceedings of the National Academy of Sciences of the United States of America, 105(25), pp. 8491–8498. Available at: https://doi.org/10.1073/pnas.0703550105.

      Vanderelst, D. and Peremans, H. (2018) ‘Modeling bat prey capture in echolocating bats : The feasibility of reactive pursuit’, Journal of theoretical biology, 456, pp. 305–314.

      Youssefi, K.A.R. and Rouhani, M. (2021) ‘Swarm intelligence based robotic search in unknown maze-like environments’, Expert Systems with Applications, 178. Available at: https://doi.org/10.1016/j.eswa.2021.114907.

      Yovel, Y. et al. (2009) ‘The voice of bats: How greater mouse-eared bats recognize individuals based on their echolocation calls’, PLoS Computational Biology, 5(6). Available at: https://doi.org/10.1371/journal.pcbi.1000400.

      Yovel, Y. and Ulanovsky, N. (2017) ‘Bat Navigation’, The Curated Reference Collection in Neuroscience and Biobehavioral Psychology, pp. 333–345. Available at: https://doi.org/10.1016/B978-0-12-809324-5.21031-6.

    1. Author response:

      We thank the reviewers for their thorough evaluation and constructive feedback on our manuscript.

      We think that their valuable suggestions will strengthen the manuscript and help us clarify several important points.

      All reviewers acknowledged the importance of our theoretical results and network classification in making pattern formation analysis a more tractable problem. At the same time, they have also raised a number of important concerns that we shall carefully consider.

      A. A major clarification that the reviewers found important concerns the definition of non-trivial pattern transformations and its generalization to higher dimensions. In this regard, the reviewers’ comments are:

      Reviewer #1:

      (on non-trivial pattern transformations):

      (3) All modelling is confined to one spatial dimension, and the very definition of a "non-trivial" transformation is framed in terms of peak positions along a line, which clearly must be reformulated for higher dimensions. It's well-known that diffusions in 1, 2, and 3 dimensions are also dramatically different, so the relevance of the three-class taxonomy to real multicellular tissues remains unclear, or at least should be explained in more detail. Reviewer #2 (on non-trivial pattern transformations):

      (5) The definition of non-trivial pattern formation is provided only in the Supplementary Information, despite its central importance for interpreting the main results. It would significantly improve clarity if this definition were included and explained in the main text. Additionally, it remains unclear how the definition is consistently applied across the different initial conditions. In particular, the authors should clarify how slope-based measures are determined for both the random noise and sharp peak/step function initial states. Furthermore, the authors do not specify how the sign function is evaluated at zero. If the standard mathematical definition sgn(0)=0 is used, then even a simple widening of a peak could fulfill the criterion for nontrivial pattern transformation.

      We agree with Reviewer #2 that including a more detailed definition of non-trivial pattern transformation in the main text would enhance the clarity of the paper. The one-dimensional (1D) definition currently provided in the Supplementary Information was chosen because all computations presented therein involve exclusively one-dimensional patterns. However, we acknowledge that this definition, as it was, did not have a totally unambiguous generalization  to higher dimensions. Therefore, in a revised version of the manuscript, we will incorporate an expanded definition applicable to higher-dimensional cases.

      This general definition of a non-trivial pattern transformation should make no reference to the sign of spatial derivatives of either the initial or resulting patterns. Specifically, a pattern transformation is considered non-trivial if it satisfies the following criteria:

      - It is heterogeneous: The resulting pattern is heterogeneous in space.

      - It is rearranging: The arrangement of critical points (i.e. peaks, valleys and saddle points in a gene product concentration) along the domain in the resulting pattern of a gene product is different to the arrangement of critical points in its initial pattern. This includes the emergence of new critical points, the disappearance of existing ones, or the spatial displacement of critical points from one location to another.

      - It is non-replicating: The spatial arrangement of critical points in the pattern of one gene product must differ from that of any other upstream gene product.

      Nonetheless, our two initial patterns are spatially discontinuous functions: in homogeneous initial patterns, the white noise is discontinuous by definition; and for the spike and spike+homogeneous initial patterns, we use sharp spikes defined by the rectangular function, which is discontinuous at the spike boundaries. Therefore, the aforementioned definition should be supplemented with the following two ad hoc assumptions:

      - Homogeneous initial patterns do not comprise any critical point. White noise in this type of initial patterns represents small thermodynamic fluctuations around the steady state and, for the purpose of pattern transformation, this is equivalent to a constant concentration along the domain.

      - Spike and spike+homogeneous initial patterns each contain a single critical point located at the center of the spike. The sharp spikes, modeled using the rectangular function, serve as a theoretical idealization to facilitate mathematical analysis. Once diffusion begins to act, these sharp boundaries are smoothed into differentiable gradients, maintaining a unique critical point at the center of the initial spike, which is the most relevant information for pattern transformation.

      Finally, it is worth recalling that our gene network classification is fundamentally based on an analysis of the dispersion relation associated with the gene network, and the construction of this dispersion relation is independent of the spatial dimensionality of the domain (i.e. it does not require assuming any specific number of dimensions). The fact that the description of this dispersion relation was in the SI may have been non-ideal for the understandability of the article and will, consequently, be moved to the main text in an upcoming version of the article. Thus, the gene networks that can lead to pattern transformation are the same in 1D, 2D or 3D. As for the resulting patterns, the broad description we provide also applies to any number of dimensions; these would be periodic, non periodic as in the amplified noise patterns or non periodic as in the hierarchic networks. For the latter notice that, except for boundary effects that we later discuss, the spike initial condition is radially symmetric and thus, the patterns resulting from it will also be radially symmetric. We will make this point more explicit in a revised version of the article, especially since, as suggested, this important portion of the Supplementary Information will be incorporated into the main text.

      Reviewer 2 suggests that with our definition of non-trivial pattern transformation, the simple widening of a concentration peak would constitute a non-trivial pattern transformation. This is not the case, as already shown in the figures as a example, since in a widening there is no change in the position of the critical point. A different situation applies if a wide and completely flat concentration peak (i.e. a plateau) forms. As we will explain in the coming version this is not possible because of requirement R5.

      We think that this clarification of the definition of non-trivial pattern transformation will also help clarify the next point (B below) since it would make it clearer that this article does not intend to explain which specific resulting pattern would arise from any given gene network.

      B. The main concern among these relates to the validity of our linearization of the model equations and the extension of the results obtained for the linear system to the fully nonlinear system. In this regard, the reviewers’ comments are:

      Reviewer #1:

      (on linearization):

      (2) A central step in the model formulation is the linearisation of the reaction term around a homogeneous steady state; higher-order kinetics, including ubiquitous bimolecular sinks such as A + B → AB, are simply collapsed into the Jacobian without any stated amplitude bound on the perturbations. Because the manuscript never analyses how far this assumption can be relaxed, the robustness of the three-class taxonomy under realistic nonlinear reactions or large spike amplitudes remains uncertain.

      Reviewer #2:

      (on linearization):

      (2) Most of the proofs presented in the Supplementary Information rely on linearized versions of the governing equations, and it remains unclear how these results extend to the fully nonlinear system. We are concerned that the generality of the conclusions drawn from the linear analysis may be overstated in the main text. For example, in Section S3, the authors introduce the concept of dynamic equivalence of transitive chains (Proposition S3.1) and intracellular transitive M-branching (Proposition S3.2), which pertains to the system's steady-state behavior. However, the proof is based solely on the linearized equations, without additional justification for why the result should hold in the presence of nonlinearities. Moreover, the linearized system is used to analyze the response to a "spike initial pattern of arbitrary height C" (SI Chapter S5.1), yet it is not clear how conclusions derived from the linear regime can be valid for large perturbations, where nonlinear effects are expected to play a significant role. We encourage the authors to clarify the assumptions under which the linearized analysis remains valid and to discuss the potential limitations of applying these results to the nonlinear regime.

      In this article, we address two main questions: first, which gene network topologies can give rise to non-trivial pattern transformations; and second, which broad types of resulting patterns can these gene network topologies give rise to resulting pattern. Thus, we are not intending to explain which exact resulting patterns would arise from any given gene network (i.e. a gene network topology with specific functions and interaction strengths or weights), a question for which non-linearities do indeed matter.

      For most known gene regulatory networks, available empirical information is typically limited to the nature of gene product regulations -indicating whether they act as activators or inhibitors- while details about the specific functional form of these regulations are rare. For instance, given two gene products, i and j, the network may indicate that i acts as an activator of j, implying that the concentration of j increases with that of i. However, this increase could follow a variety of functional forms: it may be quadratic (e.g., ), cubic (e.g., ), or any other function f j(gi). As we explain in the description of our model, we restrict our study to functions with a monotonicity constraint: higher concentrations of i lead to increased production of j (i.e., ).  In other words, a given gene interaction is always inhibitory or activatory, it does not change of sign. This monotonicity constraint corresponds to requirement (R5) in our main text. This requirement it is based on the biologically plausible idea that the complexity of gene regulation in development stems more from the topology of gene networks than from the complexity of the regulation by which a gene product may regulate another (i.e. we use simple monotonic functions).

      Question 1: A critical part to understand question 1 is in the dispersion relation that was explained in SI. From the reviewers’ comments it is clear that having this crucial part in the main text of an upcoming version of the article would improve understandability, specially for question 1.

      In brief, any pattern transformation requires the initial pattern to change. The trigger of such change is a change in the concentration of some gene product, either conceptualized as a noise fluctuation (in the homogeneous initial pattern) or a regulated change in a specific point (in the spike initial pattern). Mathematically, both can be conceptualized as perturbations and, for pattern transformation to be possible, such perturbation should grow so that the initial pattern becomes unstable and can change to another resulting pattern.

      If the perturbation is small, one can use the standard linear perturbation analysis in S6.2 of our Supplementary Information. In other words, the linear analysis is enough to ascertain if a small perturbation would grow or not. A gene network in which this will not happen would be unable to lead to pattern transformation, whichever the nonlinear part of f(g). In that sense, the linear approximation provides a necessary condition that any gene network needs to fulfill to lead to pattern transformation.

      However, the linear analysis would not ascertain whether a specific gene network will actually lead to pattern transformation (i.e., the condition is not sufficient). This, as well as the shape of the specific resulting pattern, may actually depend on the non-linear parts too. As we discuss, based on the dispersion relation, and other complementing arguments along the article, we can also get some insights on the possible patterns from the linear approximation alone (question 2). This arguments hold thanks to the imposition of requirements (R1-R5) on function f(g), which prevent strange behaviors stemming from the nonlinear part of the equation.

      The amplitude bound of perturbations mentioned by Reviewer #1 is addressed by requirements (R2) and (R4). Although the solution to the linear system predicts unbounded growth of unstable eigenmodes, the assume functions f(g) on which the nonlinear terms  eventually halt this growth, thereby ensuring the boundedness of solutions as imposed by (R4). This assumption on the nonlinear part is literally requirement R2 on f(g) in the main text.

      The transitive chains and branchings in section S3 of the Supplementary Information mentioned by the Reviewer #2 are topological properties of gene networks and therefore they influence only the linear part of the reaction-diffusion equations. This is why the proofs in that section are based on the linearized equations. We agree that clarifying this point in the text, as suggested by the reviewer, would improve the reader’s understanding of the section.

      Regarding Reviewer #2’s concerns about large perturbations, we acknowledge that the phrasing using “arbitrary height” may be confusing. For the homogeneous initial conditions these perturbations are assumed to be small because they are actually molecular noise (otherwise the initial condition could not be considered homogenous in the classical sense of developmental biology models). In the spike initial conditions in hierarchic networks the perturbation is not necessarily small. For the analysis provided in the SI we indeed assume that the perturbations are small enough for the linear approximation to be possible. Notice, however, that since these networks require an intracellular self-activating loop upstream of the first extracellular signal, the effective perturbation would rapidly grow to a value determined by such loop.

      In general the height of the initial spike does not affect the fact that hierarchic networks can lead to non-trivial pattern transformation. By definition these networks require the secretion of an extracellular signal from the cells in the spike (otherwise no change in gene product concentrations can occur over space). By definition this signal is not produced by any other cells and, thus, its concentration is governed by diffusion from the spike and its production in the cells in the spike. Thus, whichever the initial height of the spike and whichever the non-linearities in f(g), the signal’s concentration would decrease with the distance from the spike. As explained in the main text, this would lead to non-trivial pattern transformations if other general conditions are met. In general, the height of the initial perturbation can affect which specific pattern transformation would arise from a specific gene network but not which gene network topologies can lead to pattern transformation. This will be more clearly stated in an upcoming version of the article. C. In the following, we respond to the remaining concerns raised by the reviewers:

      Reviewer #1:

      (1) The Results section is difficult to follow. Key logical steps and network configurations are described shortly in prose, which constantly require the reader to address either SI or other parts of the text (see numerous links on the requirements R1-R5 listed at the beginning of the paper) to gain minimal understanding. As a result, a scientifically literate but non-specialist reader may struggle to grasp the argument with a reasonable time invested.

      We acknowledge that the current version of the main text may not be as clear as we intended. Initially, we believed that placing the more technical mathematical passages in the Supplementary Information would make the main text more accessible to readers. However, we agree with the reviewer that including some of these computations in the main text could improve clarity. We also believe that adding a summary table outlining all the model’s requirements would further contribute to that goal.

      Reviewer #2:

      (1) We have serious concerns regarding the validity of the simulation results presented in the manuscript. Rather than simulating the full nonlinear system described by Equation (1), the authors base their results on a truncated expansion (Equation S.8.2) that captures only the time evolution of small deviations around a spatially homogeneous steady state. However, it remains unclear how this reduced system is derived from the full equations specifically, which terms are retained or neglected and why- and how the expansion of the nonlinear function can be steady-state independent, as claimed. Additionally, in simulations involving the spike plus homogeneous initial condition, it is not evident -or, where equations are provided, it is not correct- that the assumed global homogeneous background actually corresponds to a steady state of the full dynamics. We elaborate on these concerns in the following:

      We believe there has been a misunderstanding regarding the presentation of the model equations (S8.2) used throughout our simulations. Accordingly, we agree that this relevant section of the Supplementary Information should be rewritten in a revised version of the manuscript to clarify this issue. Below, we address all the concerns raised by the reviewer.

      Equation (S8.2) represents the full nonlinear system described in Equation (1). While we recognize that the model may oversimplify real biological processes, its purpose is to illustrate our general statements about pattern formation rather than to capture any specific or detailed mechanism. In this context, model (S8.2) offers three key advantages for our goals: it allows rapid manipulation of gene network topology simply by modifying the matrix J, making it ideal for illustrating pattern formation across different network classes; it accommodates gene networks of arbitrary size -unlike other models, such as the classical Gierer-Meinhardt model, which are limited to two-element Turing or noise-amplifying networks-; and, due to the simplicity of its nonlinear terms, this model involves relatively few free parameters, facilitating the fine-tuning needed to identify parameter regions where non-trivial pattern transformations occur.

      Indeed, we find that the ability of model (S8.2) to illustrate our results despite having such simple nonlinear terms -bearing in mind that at least some nonlinearity is always necessary for selforganization- strongly supports the claim that the capacity of a gene network to produce pattern transformations is fully determined by the linear part of Equation (1). In this sense, nonlinear terms primarily influence the precise parameter values at which these transformations occur and contribute to shaping specific features of the resulting patterns.

      Model (S8.2) has been successfully employed in pattern formation studies elsewhere in the literature; accordingly, we provide relevant bibliographic references to support its widespread use.

      We believe the misunderstanding arises from our explanation of the biological interpretation of the model. As noted in the accompanying bibliography, the model is based on a general reactiondiffusion mechanism assuming the existence of a steady state. However, this conceptual reactiondiffusion framework is not the same as our Equation (1); rather, it was introduced by the original proponents of the model in the seminal paper cited in our text. In this context, Equation (S8.2) describes small concentration perturbations around that steady state, where the variables represent deviations in concentration relative to the general steady state.

      The aforementioned general steady state corresponds to the trivial equilibrium point g≡0 in equations (S8.2). Consequently, all our simulations based on model (S8.2) start from this steady state, to which we add white noise to generate homogeneous initial patterns or a sharp spike for the two types of spike initial patterns.

      It is also worth noting that Equations (S8.2) represent a non-dimensional model.

      It is assumed that the homogeneous steady states are given by g_i=0 and g_i=c_i, where 1/c_i = \mu_i or \hat{\mu}_i, independently of the specific network structure. However, the basis for this assumption is unclear, especially since some of the functions do not satisfy this condition -for example, f5 as defined below Eq. S8.10.5. Moreover, if g_i=c_i does not correspond to a true steady state, then the time evolution of deviations from this state is not correctly described by Eq. S8.2, as the zeroth-order terms do not vanish in that case.

      From the explanations above, it is important to distinguish two scales in the process: the scale of small perturbations, where equations (S8.2) apply; and the global scale, where the conceptual general reaction-diffusion system operates. Since the specific form of this general system does not affect equations (S8.2), we assume that it follows any of the models cited in the text, which yield a non-zero steady state at .

      In this sense, Equation (S8.2) represent a small concentration deviation of such global system and g(t ,x) is a relative concentration where g≡0 represents the steady-state at are concentrations above , and g<0 are concentrations below .

      As previously mentioned, simulations are performed using Equations (S8.2) on the basis of the equilibrium point g≡0. The result of these simulations is then superimposed on the non-zero steady state and presented in the figures along the article.

      Using the full model instead of the simplified Equations (S8.2) may result in slightly different resulting patterns, but it does not affect the gene network’s ability to produce pattern transformations, nor does it alter the main structural properties of the patterns—for example, the periodic nature of patterns generated by Turing networks.

      Additionally, the equations used contain only linear terms and a cubic degradation term for each species g_i, while neglecting all quadratic terms and cubic terms involving cross-species interactions (i≠j). An explanation for this selective truncation is not provided, and without knowledge of the full equation (f), it is impossible to assess whether this expansion is mathematically justified. If, as suggested in the Supplementary Information, the linear and cubic terms are derived from f, then at the very least, the Jacobian matrix should depend on the background steady-state concentration. However, the equations for the small deviation around a steady state (including the Jacobian matrix) used in the simulations appear to be independent of the particular steady state concentration.

      The Jacobian of Equation (S8.2) is independent of g because g represents a small perturbation around a steady state of a general reaction-diffusion system. Consequently, the matrix J corresponds to the Jacobian of the general system evaluated at that steady state. Evaluating the Jacobian of equations (S8.2) at the equilibrium point g≡0 -which represents the general steady state- recovers the matrix J.

      This is why we believe that the differences observed between the spike-only initial condition and the spike superimposed on a homogeneous background are not due to the initial conditions themselves, but rather result from a modified reaction scheme introduced through a questionable cutoff.

      "In simulations with spike initial patterns, the reference value g≡0 represents an actual concentration of 0 and therefore, we must add to (S8.2) a Heaviside function Φ acting of f (i.e., Φ(f(g))=f(g) if f(g)>0 , Φ(f(g))=0 if f(g){less than or equal to}0 ) to prevent the existence of negative concentrations for any gene product (i.e., g_i<0 for some i )." (SI chapter S8).

      This cutoff alters the dynamics (no inhibition) and introduces a different reaction scheme between the two simulations. The need for this correction may itself reflect either a problem in the original equations (which should fulfill the necessary conditions and prevent negative concentrations (R4 in main text)) or the inappropriateness of using an expanded approximation which assumes independence on the steady state concentration. It is already questionable if the linearized equations with a cubic degradation term are valid for the spike initial conditions (with different background concentration values), as the amplitude of this perturbation seems rather large.

      For homogeneous and spike+homogeneous initial conditions, we interpret equations (S8.2) as small perturbations around a non-zero steady state of a general reaction-diffusion system. For spike-only initial conditions, that steady state is zero. As we mention before, g≡0 will then represent such steady-state of zero concentration, g>0 are positive concentrations of the general system, and g<0 would represent unfeasible negative concentrations of the general system. Therefore, the use of a cutoff function to handle such initial conditions is justified. Moreover, this cutoff function is the same as the one employed in the reference general system cited in our paper.

      We acknowledge that the cutoff influences the simulations and accounts for the differences observed between spike and spike+homogeneous initial conditions. However, this distinction reflects what occurs in real biological systems, which is precisely why we differentiate these two types of initial states. For instance, the emergence of a periodic pattern in a noise-amplifying network depends critically on the formation of regions with concentrations below the steady state near the initial spike. Such regions can form in spike-plus-homogeneous initial patterns but not in spike-only initial patterns, where concentrations below the steady state would correspond to biologically unfeasible negative values.

      Lastly, we note that under the current simulation scheme, it is not possible to meaningfully assess criteria RH2a and RH2b, as they rely on nonlinear interactions that are absent from the implemented dynamics.

      It is explicitly stated in the relevant subsections of Section S7 in the Supplementary Information that, for the simulations involving RH2a and RH2b, the function f(g) in equation (S8.2) is modified by adding an ad hoc quadratic term to enable the assessment of these criteria.

      (3) Several statements in the main text are presented without accompanying proof or sufficient explanation, which makes it difficult to assess their validity. In some cases, the lack of justification raises serious doubts about whether the claims are generally true. Examples are:

      "For the purpose of clarity we will explain our results as if these cells have a simple arrangement in space (e.g., a 1D line or a 2D square lattice) but, as we will discuss, our results shall apply with the same logic to any distribution of cells in space." (Main text l.145-l.148).

      We believe that the confusion in this statement arises from the ambiguous use of the phrase “our results”. We will revise the text to provide a more precise description. Specifically, by “our results,” we refer to the conclusion that it is possible to determine whether a gene network leads to nontrivial pattern transformations based solely on its topology. This conclusion is independent of the dimensionality of space, as none of our arguments rely on assumptions specific to spatial dimensions. While one-dimensional examples are used for clarity and illustration, the underlying reasoning applies generally. In an improved version of the article, we will clarify this point explicitly and move relevant arguments from the Supplementary Information into the main text.

      Critically, our classification of gene networks is ultimately based on an argument concerning the dispersion relation associated with the network, and the construction of this dispersion relation is independent of the spatial dimensionality of the domain. In this sense, the networks identified in the text as capable of producing pattern transformations will be able to generate non-trivial pattern transformations in any spatial domain and in any number of dimensions. While the specific parameter values that permit such transformations may vary depending on the geometry and dimensionality of the domain, the existence of at least one such parameter set remains unaffected.

      The geometry of the domain can influence the specific form of the resulting patterns, but it does not alter the broader class of patterns (e.g., periodic patterns, peaks emerging around a spike, etc.) that a given gene network topology can produce. One such geometric influence, commonly observed in simulations, involves boundary effects. For example, structures such as peaks or rings forming near the boundaries may appear higher, broader, or spatially shifted compared to those arising in the central regions of the domain. However, we think a pattern consisting of a periodic train of peaks where only those near the boundary are slightly different can still be classified as a periodic pattern.

      "For any non-trivial pattern transformation (as long as it is symmetric around the initial spike), there exists an H gene network capable of producing it from a spike initial pattern." (Main text l.366f).

      A justification for this statement is provided shortly after the claim, although we acknowledge that the current explanation is somewhat cumbersome and would benefit from a clearer presentation in a revised version of the main text.

      A more detailed justification is provided in the Supplementary Information, based on three key ideas. First, any pattern (provided it is symmetric with respect to the initial spike) can be described as an arrangement of peaks with varying heights and spatial positions along a one-dimensional domain. Second, there exists a simple gene network—the diamond network—that, through parameter tuning, can produce two peaks of arbitrary height and symmetric position relative to the initial spike. Third, by placing multiple diamond networks positively upstream of a common gene product, that gene product can express peaks at each location where the upstream diamond networks induce them. Under mild additional conditions, this mechanism allows the formation of essentially any symmetric pattern. These mild conditions, along with a detailed analysis of the diamond network’s ability to generate peaks with controllable height and position, are discussed in the Supplementary Information.

      "In 2D there are no peaks but concentric rings of high gene product concentration centered around the spike, while in 3D there are concentric spherical shells." (Main text l. 447ff).

      This result pertains specifically to pattern transformations arising from spike initial patterns. As defined in the text, spike initial patterns are radially symmetric. Since diffusion preserves radial symmetry, pattern transformations from spike initial patterns in two or three dimensions reduce to effectively one-dimensional transformations along each radial direction. In this framework, each pair of concentration peaks symmetric with respect to the spike in one dimension corresponds to a ring surrounding the spike in two dimensions, and each ring in two dimensions becomes a hollow spherical shell around the spike in three dimensions.

      We agree that including a brief section in the Supplementary Information to clarify these subtleties would be helpful for readers to better understand the generalization of certain patterns to higher dimensions.

      (4) The study identifies one-signal networks and examines how combinations of these structures can give rise to minimal pattern-forming subnetworks. However, the analysis of the combinations of these minimal pattern-forming subnetworks remains relatively brief, and the manuscript does not explore how the results might change if the subnetworks were combined in upstream and downstream configurations. In our view, it is not evident that all possible gene regulatory networks can be fully characterized by these categories, nor that the resulting patterns can be reliably predicted. Rather, the approach appears more suited to identifying which known subnetworks are present within a larger network, without necessarily capturing the full dynamics of more complex configurations.

      We acknowledge that our explanation regarding the combination of sub-networks was relatively brief, and we intend to address this in a revised version. Our argument that combining sub-networks does not produce qualitatively new types of pattern transformations -beyond those already described- is based on the dispersion relation. Although this relation was only detailed in the Supplementary Information, it is central to our argument and will therefore be moved to the main text. Below, we provide an outline of this argument:

      Our study identifies two distinct behaviors of the principal branch of the dispersion relation at large wavenumbers. Based on this, gene networks capable of pattern formation can be classified into two categories: networks of the first kind, where the real part of the principal branch diverges to infinity as the wavenumber increases; and networks of the second kind, where the real part of the principal branch converges to a positive finite value for large wavenumbers. Naturally this argument applies to any gene network irrespectively of which, or how many, sub-networks are used to built it.

      Any gene regulatory network capable of pattern formation falls into one of these two categories. We identified that networks of the first kind contain at least one Turing sub-network, whereas networks of the second kind include either an H sub-network or a noise-amplifying sub-network. In this way, the primary objective of our study -namely, achieving a topological classification of gene regulatory networks capable of pattern formation- is fulfilled. It is important to note that while the dispersion relation provides broad information about the possible resulting patterns a gene network topology can produce (e.g., periodic versus noisy), it does not specify the exact patterns that emerge for each particular set of parameter values.

      Finally, regarding the shape of the resulting patterns, Figure S10 in the Supplementary Information exemplifies the notion that the behavior of combined networks can be understood as a combination of the individual behaviors of each constituent sub-network (note that the contribution of each type of sub-network in the resulting pattern is readily distinguishable). Consequently, we focus our detailed analysis on the patterning properties of the fundamental classes.

      (6) The manuscript lacks a clear and detailed explanation of the underlying model and its assumptions. In particular, it is not well-defined what constitutes a "cell" in the context of the model, nor is it justified why spatial features of cells -such as their size or boundaries- can be neglected. Furthermore, the concept of the extracellular space in the one-dimensional model remains ambiguous, making it unclear which gene products are assumed to diffuse.

      The size of cells is ignored in our model because we assume that they are small enough with respect to the total size of the domain that the space continuous reaction-diffusion equation (equation (1) in the main text) holds. Conceptually, one could understand cells in our model each of the pieces in an even partition of the domain into small subdomains surrounding each position x. This is anyway the standard procedure in most models of pattern formation by reaction-diffusion in embryonic development.

      For extracellular signals, we assume that g(t ,x) corresponds to the concentration of the signal in the extracellular space surrounding the cell located at position x. The extracellular space is any fluid medium for which Fick Laws apply and, therfore, the Fickian diffusion term in equation (1) is valid.

      For intracellular gene products, we assume that g(t ,x) corresponds to the concentration of such gene product within the cell at position x (if the gene product in hand is a transcription factor, for example), or on its surface (if it is a membrane-bound receptor). When collapsed in the continuous equations there is not such difference between being strictly within the cell or on its boundary. The only important fact is that these gene products cannot diffuse.

      Regarding cell boundaries, let us consider an extracellular signal s that regulates a transcriptor factor i within cells (in our model, i is an intracellular gene product). Such regulation shall be mediated by a membrane-bound receptor, which corresponds to intracellular gene product j. In terms of the gene regulatory network this is sji. Cell boundary effects mentioned by the reviewer should be encapsulated in the specific functional form of the regulation function f(g), but they have no effect in the actual topology of the network. Consequently, they are out of the scope of this study: as we mentioned before, considering different non-linear terms for f(g) will affect the parameter range for which a gene network is capable of producing non-trivial pattern transformations, but not their overall ability to produce non-trivial pattern transformations (i.e., the existence of at least one choice of model parameters for which such transformations take place).

      Finally, we would like to once again express our sincere gratitude to all reviewers for their insightful and constructive feedback. We are confident that the thorough peer review process will significantly enhance both the clarity and depth of our work. We greatly value the detailed comments provided and will carefully incorporate them in the preparation of a revised manuscript, which we intend to submit in the coming months.

    1. Author Response

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      Given knowledge of the amino acid sequence and of some version of the 3D structure of two monomers that are expected to form a complex, the authors investigate whether it is possible to accurately predict which residues will be in contact in the 3D structure of the expected complex. To this effect, they train a deep learning model that takes as inputs the geometric structures of the individual monomers, per-residue features (PSSMs) extracted from MSAs for each monomer, and rich representations of the amino acid sequences computed with the pre-trained protein language models ESM-1b, MSA Transformer, and ESM-IF. Predicting inter-protein contacts in complexes is an important problem. Multimer variants of AlphaFold, such as AlphaFold-Multimer, are the current state of the art for full protein complex structure prediction, and if the three-dimensional structure of a complex can be accurately predicted then the inter-protein contacts can also be accurately determined. By contrast, the method presented here seeks state-of-the-art performance among models that have been trained end-to-end for inter-protein contact prediction.

      Strengths:

      The paper is carefully written and the method is very well detailed. The model works both for homodimers and heterodimers. The ablation studies convincingly demonstrate that the chosen model architecture is appropriate for the task. Various comparisons suggest that PLMGraph-Inter performs substantially better, given the same input than DeepHomo, GLINTER, CDPred, DeepHomo2, and DRN-1D2D_Inter. As a byproduct of the analysis, a potentially useful heuristic criterion for acceptable contact prediction quality is found by the authors: namely, to have at least 50% precision in the prediction of the top 50 contacts.

      We thank the reviewer for recognizing the strengths of our work!

      Weaknesses:

      My biggest issue with this work is the evaluations made using bound monomer structures as inputs, coming from the very complexes to be predicted. Conformational changes in protein-protein association are the key element of the binding mechanism and are challenging to predict. While the GLINTER paper (Xie & Xu, 2022) is guilty of the same sin, the authors of CDPred (Guo et al., 2022) correctly only report test results obtained using predicted unbound tertiary structures as inputs to their model. Test results using experimental monomer structures in bound states can hide important limitations in the model, and thus say very little about the realistic use cases in which only the unbound structures (experimental or predicted) are available. I therefore strongly suggest reducing the importance given to the results obtained using bound structures and emphasizing instead those obtained using predicted monomer structures as inputs.

      We thank the reviewer for the suggestion! We evaluated PLMGraph-Inter with the predicted monomers and analyzed the result in details (see the “Impact of the monomeric structure quality on contact prediction” section and Figure 3). To mimic the real cases, we even deliberately reduced the performance of AF2 by using reduced MSAs (see the 2nd paragraph in the ““Impact of the monomeric structure quality on contact prediction” section). We leave some of the results in the supplementary of the current manuscript (Table S2). We will move these results to the main text to emphasize the performance of PLMGraph-Inter with the predicted monomers in the revision.

      In particular, the most relevant comparison with AlphaFold-Multimer (AFM) is given in Figure S2, not Figure 6. Unfortunately, it substantially shrinks the proportion of structures for which AFM fails while PLMGraph-Inter performs decently. Still, it would be interesting to investigate why this occurs. One possibility would be that the predicted monomer structures are of bad quality there, and PLMGraph-Inter may be able to rely on a signal from its language model features instead. Finally, AFM multimer confidence values ("iptm + ptm") should be provided, especially in the cases in which AFM struggles.

      We thank the reviewer for the suggestion! Yes! The performance of PLMGraph-Inter drops when the predicted monomers are used in the prediction. However, it is difficult to say which is a fairer comparison, Figure 6 or Figure S2, since AFM also searched monomer templates (see the third paragraph in 7. Supplementary Information : 7.1 Data in the AlphaFold-Multimer preprint: https://www.biorxiv.org/content/10.1101/2021.10.04.463034v2.full) in the prediction. When we checked our AFM runs, we found that 99% of the targets in our study (including all the targets in the four datasets: HomoPDB, HeteroPDB, DHTest and DB5.5) employed at least 20 templates in their predictions, and 87.8% of the targets employed the native templates. We will provide the AFM confidence values of the AFM predictions in the revision.

      Besides, in cases where any experimental structures - bound or unbound - are available and given to PLMGraph-Inter as inputs, they should also be provided to AlphaFold-Multimer (AFM) as templates. Withholding these from AFM only makes the comparison artificially unfair. Hence, a new test should be run using AFM templates, and a new version of Figure 6 should be produced. Additionally, AFM's mean precision, at least for top-50 contact prediction, should be reported so it can be compared with PLMGraph-Inter's.

      We thank the reviewers for the suggestion! We would like to notify that AFM also searched monomer templates (see the third paragraph in 7. Supplementary Information : 7.1 Data in the AlphaFold-Multimer preprint: https://www.biorxiv.org/content/10.1101/2021.10.04.463034v2.full) in the prediction. When we checked our AFM runs, we found that 99% of the targets in our study (including all the targets in the four datasets: HomoPDB, HeteroPDB, DHTest and DB5.5) employed at least 20 templates in their predictions, and 87.8% of the targets employed the native template.

      It's a shame that many of the structures used in the comparison with AFM are actually in the AFM v2 training set. If there are any outside the AFM v2 training set and, ideally, not sequence- or structure-homologous to anything in the AFM v2 training set, they should be discussed and reported on separately. In addition, why not test on structures from the "Benchmark 2" or "Recent-PDB-Multimers" datasets used in the AFM paper?

      We thank the reviewer for the suggestion! The biggest challenge to objectively evaluate AFM is that as far as we known, AFM does not release the PDB ids of its training set and the “Recent-PDB-Multimers” dataset. “Benchmark 2” only includes 17 heterodimer proteins, and the number can be further decreased after removing targets redundant to our training set. We think it is difficult to draw conclusions from such a small number of targets. In the revision, we will analyze the performance of AFM on targets released after the date cutoff of the AFM training set, but with which we cannot totally remove the redundancy between the training and the test sets of AFM.

      It is also worth noting that the AFM v2 weights have now been outdated for a while, and better v3 weights now exist, with a training cutoff of 2021-09-30.

      We thank the reviewer for reminding the new version of AFM. The only difference between AFM V3 and V2 is the cutoff date of the training set. Our test set would have more overlaps with the training set of AFM V3, which is one reason that we think AFM V2 is more appropriate to be used in the comparison.

      Another weakness in the evaluation framework: because PLMGraph-Inter uses structural inputs, it is not sufficient to make its test set non-redundant in sequence to its training set. It must also be non-redundant in structure. The Benchmark 2 dataset mentioned above is an example of a test set constructed by removing structures with homologous templates in the AF2 training set. Something similar should be done here.

      We agree with the reviewer that testing whether the model can keep its performance on targets with no templates (i.e. non-redundant in structure) is important. We will perform the analysis in the revision.

      Finally, the performance of DRN-1D2D for top-50 precision reported in Table 1 suggests to me that, in an ablation study, language model features alone would yield better performance than geometric features alone. So, I am puzzled why model "a" in the ablation is a "geometry-only" model and not a "LM-only" one.

      Using the protein geometric graph to integrate multiple protein language models is the main idea of PLMGraph-Inter. Comparing with our previous work (DRN-1D2D_Inter), we consider the building of the geometric graph as one major contribution of this work. To emphasize the efficacy of this geometric graph, we chose to use the “geometry-only” model as the base model. We will further clarity this in the revision.

      Reviewer #2 (Public Review):

      This work introduces PLMGraph-Inter, a new deep-learning approach for predicting inter-protein contacts, which is crucial for understanding protein-protein interactions. Despite advancements in this field, especially driven by AlphaFold, prediction accuracy and efficiency in terms of computational cost) still remains an area for improvement. PLMGraph-Inter utilizes invariant geometric graphs to integrate the features from multiple protein language models into the structural information of each subunit. When compared against other inter-protein contact prediction methods, PLMGraph-Inter shows better performance which indicates that utilizing both sequence embeddings and structural embeddings is important to achieve high-accuracy predictions with relatively smaller computational costs for the model training.

      The conclusions of this paper are mostly well supported by data, but test examples should be revisited with a more strict sequence identity cutoff to avoid any potential information leakage from the training data. The main figures should be improved to make them easier to understand.

      We thank the reviewer for recognizing the significance of our work! We will revise the manuscript carefully to address the reviewer’s concerns.

      1. The sequence identity cutoff to remove redundancies between training and test set was set to 40%, which is a bit high to remove test examples having homology to training examples. For example, CDPred uses a sequence identity cutoff of 30% to strictly remove redundancies between training and test set examples. To make their results more solid, the authors should have curated test examples with lower sequence identity cutoffs, or have provided the performance changes against sequence identities to the closest training examples.

      We thank the reviewer for the valuable suggestion! Using different thresholds to reduce the redundancy between the test set and the training set is a very good suggestion, and we will perform the analysis in the revision. In the current version of the manuscript, the 40% sequence identity is used as the cutoff for many previous studies used this cutoff (e.g. the Recent-PDB-Multimers used in AlphaFold-Multimer (see: 7.8 Datasets in the AlphaFold-Multimer paper); the work of DSCRIPT: https://www.cell.com/action/showPdf?pii=S2405-4712%2821%2900333-1 (see: the PPI dataset paragraph in the METHODS DETAILS section of the STAR METHODS)). One reason for using the relatively higher threshold for PPI studies is that PPIs are generally not as conserved as protein monomers.

      We performed a preliminary analysis using different thresholds to remove redundancy when preparing this provisional response letter:

      Author response table 1.

      Table1. The performance of PLMGraph-Inter on the HomoPDB and HeteroPDB test sets using native structures(AlphaFold2 predicted structures).

      Method:

      To remove redundancy, we clustered 11096 sequences from the training set and test sets (HomoPDB, HeteroPDB) using MMSeq2 with different sequence identity threshold (40%, 30%, 20%, 10%) (the lowest cutoff for CD-HIT is 40%, so we switched to MMSeq2). Each sequence is then uniquely labeled by the cluster (e.g. cluster 0, cluster 1, …) to which it belongs, from which each PPI can be marked with a pair of clusters (e.g. cluster 0-cluster 1). The PPIs belonging to the same cluster pair (note: cluster n - cluster m and cluster n-cluster m were considered as the same pair) were considered as redundant. For each PPI in the test set, if the pair cluster it belongs to contains the PPI belonging to the training set, we remove that PPI from the test set.

      We will perform more detailed analyses in the revised manuscript.

      1. Figures with head-to-head comparison scatter plots are hard to understand as scatter plots because too many different methods are abstracted into a single plot with multiple colors. It would be better to provide individual head-to-head scatter plots as supplementary figures, not in the main figure.

      We thank the reviewer for the suggestion! We will include the individual head-to-head scatter plots as supplementary figures in the revision.

      3) The authors claim that PLMGraph-Inter is complementary to AlphaFold-multimer as it shows better precision for the cases where AlphaFold-multimer fails. To strengthen the point, the qualities of predicted complex structures via protein-protein docking with predicted contacts as restraints should have been compared to those of AlphaFold-multimer structures.

      We thank the reviewer for the suggestion! We will add this comparison in the revision.

      4) It would be interesting to further analyze whether there is a difference in prediction performance depending on the depth of multiple sequence alignment or the type of complex (antigen-antibody, enzyme-substrates, single species PPI, multiple species PPI, etc).

      We thank the reviewer for the suggestion! We will perform such analysis in the revision.

    1. Author response:

      eLife Assessment 

      This valuable study investigates how the neural representation of individual finger movements changes during the early period of sequence learning. By combining a new method for extracting features from human magnetoencephalography data and decoding analyses, the authors provide incomplete evidence of an early, swift change in the brain regions correlated with sequence learning, including a set of previously unreported frontal cortical regions. The addition of more control analyses to rule out that head movement artefacts influence the findings, and to further explain the proposal of offline contextualization during short rest periods as the basis for improvement performance would strengthen the manuscript. 

      We appreciate the Editorial assessment on our paper’s strengths and novelty.  We have implemented additional control analyses to show that neither task-related eye movements nor increasing overlap of finger movements during learning account for our findings, which are that contextualized neural representations in a network of bilateral frontoparietal brain regions actively contribute to skill learning.  Importantly, we carried out additional analyses showing that contextualization develops predominantly during rest intervals.

      Public Reviews:

      We thank the Reviewers for their comments and suggestions, prompting new analyses and additions that strengthened our report.

      Reviewer #1 (Public review): 

      Summary: 

      This study addresses the issue of rapid skill learning and whether individual sequence elements (here: finger presses) are differentially represented in human MEG data. The authors use a decoding approach to classify individual finger elements and accomplish an accuracy of around 94%. A relevant finding is that the neural representations of individual finger elements dynamically change over the course of learning. This would be highly relevant for any attempts to develop better brain machine interfaces - one now can decode individual elements within a sequence with high precision, but these representations are not static but develop over the course of learning. 

      Strengths: The work follows a large body of work from the same group on the behavioural and neural foundations of sequence learning. The behavioural task is well established and neatly designed to allow for tracking learning and how individual sequence elements contribute. The inclusion of short offline rest periods between learning epochs has been influential because it has revealed that a lot, if not most of the gains in behaviour (ie speed of finger movements) occur in these so-called micro-offline rest periods. The authors use a range of new decoding techniques, and exhaustively interrogate their data in different ways, using different decoding approaches. Regardless of the approach, impressively high decoding accuracies are observed, but when using a hybrid approach that combines the MEG data in different ways, the authors observe decoding accuracies of individual sequence elements from the MEG data of up to 94%. 

      We have previously showed that neural replay of MEG activity representing the practiced skill correlated with micro-offline gains during rest intervals of early learning, 1 consistent with the recent report that hippocampal ripples during these offline periods predict human motor sequence learning2.  However, decoding accuracy in our earlier work1 needed improvement.  Here, we reported a strategy to improve decoding accuracy that could benefit future studies of neural replay or BCI using MEG.

      Weaknesses: 

      There are a few concerns which the authors may well be able to resolve. These are not weaknesses as such, but factors that would be helpful to address as these concern potential contributions to the results that one would like to rule out. Regarding the decoding results shown in Figure 2 etc, a concern is that within individual frequency bands, the highest accuracy seems to be within frequencies that match the rate of keypresses. This is a general concern when relating movement to brain activity, so is not specific to decoding as done here. As far as reported, there was no specific restraint to the arm or shoulder, and even then it is conceivable that small head movements would correlate highly with the vigor of individual finger movements. This concern is supported by the highest contribution in decoding accuracy being in middle frontal regions - midline structures that would be specifically sensitive to movement artefacts and don't seem to come to mind as key structures for very simple sequential keypress tasks such as this - and the overall pattern is remarkably symmetrical (despite being a unimanual finger task) and spatially broad. This issue may well be matching the time course of learning, as the vigor and speed of finger presses will also influence the degree to which the arm/shoulder and head move. This is not to say that useful information is contained within either of the frequencies or broadband data. But it raises the question of whether a lot is dominated by movement "artefacts" and one may get a more specific answer if removing any such contributions. 

      Reviewer #1 expresses concern that the combination of the low-frequency narrow-band decoder results, and the bilateral middle frontal regions displaying the highest average intra-parcel decoding performance across subjects is suggestive that the decoding results could be driven by head movement or other artefacts.

      Head movement artefacts are highly unlikely to contribute meaningfully to our results for the following reasons. First, in addition to ICA denoising, all “recordings were visually inspected and marked to denoise segments containing other large amplitude artifacts due to movements” (see Methods). Second, the response pad was positioned in a manner that minimized wrist, arm or more proximal body movements during the task. Third, while head position was not monitored online for this study, the head was restrained using an inflatable air bladder, and head position was assessed at the beginning and at the end of each recording. Head movement did not exceed 5mm between the beginning and end of each scan for all participants included in the study. Fourth, we agree that despite the steps taken above, it is possible that minor head movements could still contribute to some remaining variance in the MEG data in our study. The Reviewer states a concern that “it is conceivable that small head movements would correlate highly with the vigor of individual finger movements”. However, in order for any such correlations to meaningfully impact decoding performance, such head movements would need to: (A) be consistent and pervasive throughout the recording (which might not be the case if the head movements were related to movement vigor and vigor changed over time); and (B) systematically vary between different finger movements, and also between the same finger movement performed at different sequence locations (see 5-class decoding performance in Figure 4B). The possibility of any head movement artefacts meeting all these conditions is extremely unlikely.

      Given the task design, a much more likely confound in our estimation would be the contribution of eye movement artefacts to the decoder performance (an issue appropriately raised by Reviewer #3 in the comments below). Remember from Figure 1A in the manuscript that an asterisk marks the current position in the sequence and is updated at each keypress. Since participants make very few performance errors, the position of the asterisk on the display is highly correlated with the keypress being made in the sequence. Thus, it is possible that if participants are attending to the visual feedback provided on the display, they may move their eyes in a way that is systematically related to the task.  Since we did record eye movements simultaneously with the MEG recordings (EyeLink 1000 Plus; Fs = 600 Hz), we were able to perform a control analysis to address this question. For each keypress event during trials in which no errors occurred (which is the same time-point that the asterisk position is updated), we extracted three features related to eye movements: 1) the gaze position at the time of asterisk position update (or keyDown event), 2) the gaze position 150ms later, and 3) the peak velocity of the eye movement between the two positions. We then constructed a classifier from these features with the aim of predicting the location of the asterisk (ordinal positions 1-5) on the display. As shown in the confusion matrix below (Author response image 1), the classifier failed to perform above chance levels (Overall cross-validated accuracy = 0.21817):

      Author response image 1.

      Confusion matrix showing that three eye movement features fail to predict asterisk position on the task display above chance levels (Fold 1 test accuracy = 0.21718; Fold 2 test accuracy = 0.22023; Fold 3 test accuracy = 0.21859; Fold 4 test accuracy = 0.22113; Fold 5 test accuracy = 0.21373; Overall cross-validated accuracy = 0.2181). Since the ordinal position of the asterisk on the display is highly correlated with the ordinal position of individual keypresses in the sequence, this analysis provides strong evidence that keypress decoding performance from MEG features is not explained by systematic relationships between finger movement behavior and eye movements (i.e. – behavioral artefacts).

      In fact, inspection of the eye position data revealed that a majority of participants on most trials displayed random walk gaze patterns around a center fixation point, indicating that participants did not attend to the asterisk position on the display. This is consistent with intrinsic generation of the action sequence, and congruent with the fact that the display does not provide explicit feedback related to performance. A similar real-world example would be manually inputting a long password into a secure online application. In this case, one intrinsically generates the sequence from memory and receives similar feedback about the password sequence position (also provided as asterisks), which is typically ignored by the user. The minimal participant engagement with the visual task display observed in this study highlights another important point – that the behavior in explicit sequence learning motor tasks is highly generative in nature rather than reactive to stimulus cues as in the serial reaction time task (SRTT).  This is a crucial difference that must be carefully considered when designing investigations and comparing findings across studies.

      We observed that initial keypress decoding accuracy was predominantly driven by contralateral primary sensorimotor cortex in the initial practice trials before transitioning to bilateral frontoparietal regions by trials 11 or 12 as performance gains plateaued.  The contribution of contralateral primary sensorimotor areas to early skill learning has been extensively reported in humans and non-human animals. 1,3-5  Similarly, the increased involvement of bilateral frontal and parietal regions to decoding during early skill learning in the non-dominant hand is well known.  Enhanced bilateral activation in both frontal and parietal cortex during skill learning has been extensively reported6-11, and appears to be even more prominent during early fine motor skill learning in the non-dominant hand12,13.  The frontal regions identified in these studies are known to play crucial roles in executive control14, motor planning15, and working memory6,8,16-18 processes, while the same parietal regions are known to integrate multimodal sensory feedback and support visuomotor transformations6,8,16-18, in addition to working memory19. Thus, it is not surprising that these regions increasingly contribute to decoding as subjects internalize the sequential task.  We now include a statement reflecting these considerations in the revised Discussion.

      A somewhat related point is this: when combining voxel and parcel space, a concern is whether a degree of circularity may have contributed to the improved accuracy of the combined data, because it seems to use the same MEG signals twice - the voxels most contributing are also those contributing most to a parcel being identified as relevant, as parcels reflect the average of voxels within a boundary. In this context, I struggled to understand the explanation given, ie that the improved accuracy of the hybrid model may be due to "lower spatially resolved whole-brain and higher spatially resolved regional activity patterns".

      We strongly disagree with the Reviewer’s assertion that the construction of the hybrid-space decoder is circular. To clarify, the base feature set for the hybrid-space decoder constructed for all participants includes whole-brain spatial patterns of MEG source activity averaged within parcels. As stated in the manuscript, these 148 inter-parcel features reflect “lower spatially resolved whole-brain activity patterns” or global brain dynamics. We then independently test how well spatial patterns of MEG source activity for all voxels distributed within individual parcels can decode keypress actions. Again, the testing of these intra-parcel spatial patterns, intended to capture “higher spatially resolved regional brain activity patterns”, is completely independent from one another and independent from the weighting of individual inter-parcel features. These intra-parcel features could, for example, provide additional information about muscle activation patterns or the task environment. These approximately 1150 intra-parcel voxels (on average, within the total number varying between subjects) are then combined with the 148 inter-parcel features to construct the final hybrid-space decoder. In fact, this varied spatial filter approach shares some similarities to the construction of convolutional neural networks (CNNs) used to perform object recognition in image classification applications. One could also view this hybrid-space decoding approach as a spatial analogue to common time-frequency based analyses such as theta-gamma phase amplitude coupling (PAC), which combine information from two or more narrow-band spectral features derived from the same time-series data.

      We directly tested this hypothesis – that spatially overlapping intra- and inter-parcel features portray different information – by constructing an alternative hybrid-space decoder (HybridAlt) that excluded average inter-parcel features which spatially overlapped with intra-parcel voxel features, and comparing the performance to the decoder used in the manuscript (HybridOrig). The prediction was that if the overlapping parcel contained similar information to the more spatially resolved voxel patterns, then removing the parcel features (n=8) from the decoding analysis should not impact performance. In fact, despite making up less than 1% of the overall input feature space, removing those parcels resulted in a significant drop in overall performance greater than 2% (78.15% ± SD 7.03% for HybridOrig vs. 75.49% ± SD 7.17% for HybridAlt; Wilcoxon signed rank test, z = 3.7410, p = 1.8326e-04) (Author response image 2).

      Author response image 2.

      Comparison of decoding performances with two different hybrid approaches. HybridAlt: Intra-parcel voxel-space features of top ranked parcels and inter-parcel features of remaining parcels. HybridOrig:  Voxel-space features of top ranked parcels and whole-brain parcel-space features (i.e. – the version used in the manuscript). Dots represent decoding accuracy for individual subjects. Dashed lines indicate the trend in performance change across participants. Note, that HybridOrig (the approach used in our manuscript) significantly outperforms the HybridAlt approach, indicating that the excluded parcel features provide unique information compared to the spatially overlapping intra-parcel voxel patterns.

      Firstly, there will be a relatively high degree of spatial contiguity among voxels because of the nature of the signal measured, i.e. nearby individual voxels are unlikely to be independent. Secondly, the voxel data gives a somewhat misleading sense of precision; the inversion can be set up to give an estimate for each voxel, but there will not just be dependence among adjacent voxels, but also substantial variation in the sensitivity and confidence with which activity can be projected to different parts of the brain. Midline and deeper structures come to mind, where the inversion will be more problematic than for regions along the dorsal convexity of the brain, and a concern is that in those midline structures, the highest decoding accuracy is seen. 

      We definitely agree with the Reviewer that some inter-parcel features representing neighboring (or spatially contiguous) voxels are likely to be correlated. This has been well documented in the MEG literature20,21 and is a particularly important confound to address in functional or effective connectivity analyses (not performed in the present study). In the present analysis, any correlation between adjacent voxels presents a multi-collinearity problem, which effectively reduces the dimensionality of the input feature space. However, as long as there are multiple groups of correlated voxels within each parcel (i.e. - the effective dimensionality is still greater than 1), the intra-parcel spatial patterns could still meaningfully contribute to the decoder performance. Two specific results support this assertion.

      First, we obtained higher decoding accuracy with voxel-space features [74.51% (± SD 7.34%)] compared to parcel space features [68.77% (± SD 7.6%)] (Figure 3B), indicating individual voxels carry more information in decoding the keypresses than the averaged voxel-space features or parcel-space features.  Second, Individual voxels within a parcel showed varying feature importance scores in decoding keypresses (Author response image 3). This finding supports the Reviewer’s assertion that neighboring voxels express similar information, but also shows that the correlated voxels form mini subclusters that are much smaller spatially than the parcel they reside in.

      Author response image 3.

      Feature importance score of individual voxels in decoding keypresses: MRMR was used to rank the individual voxel space features in decoding keypresses and the min-max normalized MRMR score was mapped to a structural brain surface. Note that individual voxels within a parcel showed different contribution to decoding.

       

      Some of these concerns could be addressed by recording head movement (with enough precision) to regress out these contributions. The authors state that head movement was monitored with 3 fiducials, and their time courses ought to provide a way to deal with this issue. The ICA procedure may not have sufficiently dealt with removing movement-related problems, but one could eg relate individual components that were identified to the keypresses as another means for checking. An alternative could be to focus on frequency ranges above the movement frequencies. The accuracy for those still seems impressive and may provide a slightly more biologically plausible assessment. 

      We have already addressed the issue of movement related artefacts in the first response above. With respect to a focus on frequency ranges above movement frequencies, the Reviewer states the “accuracy for those still seems impressive and may provide a slightly more biologically plausible assessment”. First, it is important to note that cortical delta-band oscillations measured with local field potentials (LFPs) in macaques is known to contain important information related to end-effector kinematics22,23 muscle activation patterns24 and temporal sequencing25 during skilled reaching and grasping actions. Thus, there is a substantial body of evidence that low-frequency neural oscillatory activity in this range contains important information about the skill learning behavior investigated in the present study. Second, our own data shows (which the Reviewer also points out) that significant information related to the skill learning behavior is also present in higher frequency bands (see Figure 2A and Figure 3—figure supplement 1). As we pointed out in our earlier response to questions about the hybrid space decoder architecture (see above), it is likely that different, yet complimentary, information is encoded across different temporal frequencies (just as it is encoded across different spatial frequencies). Again, this interpretation is supported by our data as the highest performing classifiers in all cases (when holding all parameters constant) were always constructed from broadband input MEG data (Figure 2A and Figure 3—figure supplement 1).  

      One question concerns the interpretation of the results shown in Figure 4. They imply that during the course of learning, entirely different brain networks underpin the behaviour. Not only that, but they also include regions that would seem rather unexpected to be key nodes for learning and expressing relatively simple finger sequences, such as here. What then is the biological plausibility of these results? The authors seem to circumnavigate this issue by moving into a distance metric that captures the (neural network) changes over the course of learning, but the discussion seems detached from which regions are actually involved; or they offer a rather broad discussion of the anatomical regions identified here, eg in the context of LFOs, where they merely refer to "frontoparietal regions". 

      The Reviewer notes the shift in brain networks driving keypress decoding performance between trials 1, 11 and 36 as shown in Figure 4A. The Reviewer questions whether these substantial shifts in brain network states underpinning the skill are biologically plausible, as well as the likelihood that bilateral superior and middle frontal and parietal cortex are important nodes within these networks.

      First, previous fMRI work in humans performing a similar sequence learning task showed that flexibility in brain network composition (i.e. – changes in brain region members displaying coordinated activity) is up-regulated in novel learning environments and explains differences in learning rates across individuals26.  This work supports our interpretation of the present study data, that brain networks engaged in sequential motor skills rapidly reconfigure during early learning.

      Second, frontoparietal network activity is known to support motor memory encoding during early learning27,28. For example, reactivation events in the posterior parietal29 and medial prefrontal30,31 cortex (MPFC) have been temporally linked to hippocampal replay, and are posited to support memory consolidation across several memory domains32, including motor sequence learning1,33,34.  Further, synchronized interactions between MPFC and hippocampus are more prominent during early learning as opposed to later stages27,35,36, perhaps reflecting “redistribution of hippocampal memories to MPFC” 27.  MPFC contributes to very early memory formation by learning association between contexts, locations, events and adaptive responses during rapid learning37. Consistently, coupling between hippocampus and MPFC has been shown during, and importantly immediately following (rest) initial memory encoding38,39.  Importantly, MPFC activity during initial memory encoding predicts subsequent recall40. Thus, the spatial map required to encode a motor sequence memory may be “built under the supervision of the prefrontal cortex” 28, also engaged in the development of an abstract representation of the sequence41.  In more abstract terms, the prefrontal, premotor and parietal cortices support novice performance “by deploying attentional and control processes” 42-44 required during early learning42-44. The dorsolateral prefrontal cortex DLPFC specifically is thought to engage in goal selection and sequence monitoring during early skill practice45, all consistent with the schema model of declarative memory in which prefrontal cortices play an important role in encoding46,47.  Thus, several prefrontal and frontoparietal regions contributing to long term learning 48 are also engaged in early stages of encoding. Altogether, there is strong biological support for the involvement of bilateral prefrontal and frontoparietal regions to decoding during early skill learning.  We now address this issue in the revised manuscript.

      If I understand correctly, the offline neural representation analysis is in essence the comparison of the last keypress vs the first keypress of the next sequence. In that sense, the activity during offline rest periods is actually not considered. This makes the nomenclature somewhat confusing. While it matches the behavioural analysis, having only key presses one can't do it in any other way, but here the authors actually do have recordings of brain activity during offline rest. So at the very least calling it offline neural representation is misleading to this reviewer because what is compared is activity during the last and during the next keypress, not activity during offline periods. But it also seems a missed opportunity - the authors argue that most of the relevant learning occurs during offline rest periods, yet there is no attempt to actually test whether activity during this period can be useful for the questions at hand here. 

      We agree with the Reviewer that our previous “offline neural representation” nomenclature could be misinterpreted. In the revised manuscript we refer to this difference as the “offline neural representational change”. Please, note that our previous work did link offline neural activity (i.e. – 16-22 Hz beta power and neural replay density during inter-practice rest periods) to observed micro-offline gains49.

      Reviewer #2 (Public review): 

      Summary 

      Dash et al. asked whether and how the neural representation of individual finger movements is "contextualized" within a trained sequence during the very early period of sequential skill learning by using decoding of MEG signal. Specifically, they assessed whether/how the same finger presses (pressing index finger) embedded in the different ordinal positions of a practiced sequence (4-1-3-2-4; here, the numbers 1 through 4 correspond to the little through the index fingers of the non-dominant left hand) change their representation (MEG feature). They did this by computing either the decoding accuracy of the index finger at the ordinal positions 1 vs. 5 (index_OP1 vs index_OP5) or pattern distance between index_OP1 vs. index_OP5 at each training trial and found that both the decoding accuracy and the pattern distance progressively increase over the course of learning trials. More interestingly, they also computed the pattern distance for index_OP5 for the last execution of a practice trial vs. index_OP1 for the first execution in the next practice trial (i.e., across the rest period). This "off-line" distance was significantly larger than the "on-line" distance, which was computed within practice trials and predicted micro-offline skill gain. Based on these results, the authors conclude that the differentiation of representation for the identical movement embedded in different positions of a sequential skill ("contextualization") primarily occurs during early skill learning, especially during rest, consistent with the recent theory of the "micro-offline learning" proposed by the authors' group. I think this is an important and timely topic for the field of motor learning and beyond. <br /> Strengths 

      The specific strengths of the current work are as follows. First, the use of temporally rich neural information (MEG signal) has a large advantage over previous studies testing sequential representations using fMRI. This allowed the authors to examine the earliest period (= the first few minutes of training) of skill learning with finer temporal resolution. Second, through the optimization of MEG feature extraction, the current study achieved extremely high decoding accuracy (approx. 94%) compared to previous works. As claimed by the authors, this is one of the strengths of the paper (but see my comments). Third, although some potential refinement might be needed, comparing "online" and "offline" pattern distance is a neat idea. 

      Weaknesses 

      Along with the strengths I raised above, the paper has some weaknesses. First, the pursuit of high decoding accuracy, especially the choice of time points and window length (i.e., 200 msec window starting from 0 msec from key press onset), casts a shadow on the interpretation of the main result. Currently, it is unclear whether the decoding results simply reflect behavioral change or true underlying neural change. As shown in the behavioral data, the key press speed reached 3~4 presses per second already at around the end of the early learning period (11th trial), which means inter-press intervals become as short as 250-330 msec. Thus, in almost more than 60% of training period data, the time window for MEG feature extraction (200 msec) spans around 60% of the inter-press intervals. Considering that the preparation/cueing of subsequent presses starts ahead of the actual press (e.g., Kornysheva et al., 2019) and/or potential online planning (e.g., Ariani and Diedrichsen, 2019), the decoder likely has captured these future press information as well as the signal related to the current key press, independent of the formation of genuine sequential representation (e.g., "contextualization" of individual press). This may also explain the gradual increase in decoding accuracy or pattern distance between index_OP1 vs. index_OP5 (Figure 4C and 5A), which co-occurred with performance improvement, as shorter inter-press intervals are more favorable for the dissociating the two index finger presses followed by different finger presses. The compromised decoding accuracies for the control sequences can be explained in similar logic. Therefore, more careful consideration and elaborated discussion seem necessary when trying to both achieve high-performance decoding and assess early skill learning, as it can impact all the subsequent analyses.

      The Reviewer raises the possibility that (given the windowing parameters used in the present study) an increase in “contextualization” with learning could simply reflect faster typing speeds as opposed to an actual change in the underlying neural representation. The issue can essentially be framed as a mixing problem. As correct sequences are generated at higher and higher speeds over training, MEG activity patterns related to the planning, execution, evaluation and memory of individual keypresses overlap more in time. Thus, increased overlap between the “4” and “1” keypresses (at the start of the sequence) and “2” and “4” keypresses (at the end of the sequence) could artefactually increase contextualization distances even if the underlying neural representations for the individual keypresses remain unchanged (assuming this mixing of representations is used by the classifier to differentially tag each index finger press). If this were the case, it follows that such mixing effects reflecting the ordinal sequence structure would also be observable in the distribution of decoder misclassifications. For example, “4” keypresses would be more likely to be misclassified as “1” or “2” keypresses (or vice versa) than as “3” keypresses. The confusion matrices presented in Figures 3C and 4B and Figure 3—figure supplement 3A in the previously submitted manuscript do not show this trend in the distribution of misclassifications across the four fingers.

      Moreover, if the representation distance is largely driven by this mixing effect, it’s also possible that the increased overlap between consecutive index finger keypresses during the 4-4 transition marking the end of one sequence and the beginning of the next one could actually mask contextualization-related changes to the underlying neural representations and make them harder to detect. In this case, a decoder tasked with separating individual index finger keypresses into two distinct classes based upon sequence position might show decreased performance with learning as adjacent keypresses overlapped in time with each other to an increasing extent. However, Figure 4C in our previously submitted manuscript does not support this possibility, as the 2-class hybrid classifier displays improved classification performance over early practice trials despite greater temporal overlap.

      We also conducted a new multivariate regression analysis to directly assess whether the neural representation distance score could be predicted by the 4-1, 2-4 and 4-4 keypress transition times observed for each complete correct sequence (both predictor and response variables were z-score normalized within-subject). The results of this analysis affirmed that the possible alternative explanation put forward by the Reviewer is not supported by our data (Adjusted R2 = 0.00431; F = 5.62). We now include this new negative control analysis result in the revised manuscript.

      Overall, we do strongly agree with the Reviewer that the naturalistic, self-paced, generative task employed in the present study results in overlapping brain processes related to planning, execution, evaluation and memory of the action sequence. We also agree that there are several tradeoffs to consider in the construction of the classifiers depending on the study aim. Given our aim of optimizing keypress decoder accuracy in the present study, the set of trade-offs resulted in representations reflecting more the latter three processes, and less so the planning component. Whether separate decoders can be constructed to tease apart the representations or networks supporting these overlapping processes is an important future direction of research in this area. For example, work presently underway in our lab constrains the selection of windowing parameters in a manner that allows individual classifiers to be temporally linked to specific planning, execution, evaluation or memory-related processes to discern which brain networks are involved and how they adaptively reorganize with learning. Results from the present study (Figure 4—figure supplement 2) showing hybrid-space decoder prediction accuracies exceeding 74% for temporal windows spanning as little as 25ms and located up to 100ms prior to the keyDown event strongly support the feasibility of such an approach.

      Related to the above point, testing only one particular sequence (4-1-3-2-4), aside from the control ones, limits the generalizability of the finding. This also may have contributed to the extremely high decoding accuracy reported in the current study. 

      The Reviewer raises a question about the generalizability of the decoder accuracy reported in our study. Fortunately, a comparison between decoder performances on Day 1 and Day 2 datasets does provide some insight into this issue. As the Reviewer points out, the classifiers in this study were trained and tested on keypresses performed while practicing a specific sequence (4-1-3-2-4). The study was designed this way as to avoid the impact of interference effects on learning dynamics. The cross-validated performance of classifiers on MEG data collected within the same session was 90.47% overall accuracy (4-class; Figure 3C). We then tested classifier performance on data collected during a separate MEG session conducted approximately 24 hours later (Day 2; see Figure 3—supplement 3). We observed a reduction in overall accuracy rate to 87.11% when tested on MEG data recorded while participants performed the same learned sequence, and 79.44% when they performed several previously unpracticed sequences. Both changes in accuracy are important with regards to the generalizability of our findings. First, 87.11% performance accuracy for the trained sequence data on Day 2 (a reduction of only 3.36%) indicates that the hybrid-space decoder performance is robust over multiple MEG sessions, and thus, robust to variations in SNR across the MEG sensor array caused by small differences in head position between scans.  This indicates a substantial advantage over sensor-space decoding approaches. Furthermore, when tested on data from unpracticed sequences, overall performance dropped an additional 7.67%. This difference reflects the performance bias of the classifier for the trained sequence, possibly caused by high-order sequence structure being incorporated into the feature weights. In the future, it will be important to understand in more detail how random or repeated keypress sequence training data impacts overall decoder performance and generalization. We strongly agree with the Reviewer that the issue of generalizability is extremely important and have added a new paragraph to the Discussion in the revised manuscript highlighting the strengths and weaknesses of our study with respect to this issue.

      In terms of clinical BCI, one of the potential relevance of the study, as claimed by the authors, it is not clear that the specific time window chosen in the current study (up to 200 msec since key press onset) is really useful. In most cases, clinical BCI would target neural signals with no overt movement execution due to patients' inability to move (e.g., Hochberg et al., 2012). Given the time window, the surprisingly high performance of the current decoder may result from sensory feedback and/or planning of subsequent movement, which may not always be available in the clinical BCI context. Of course, the decoding accuracy is still much higher than chance even when using signal before the key press (as shown in Figure 4 Supplement 2), but it is not immediately clear to me that the authors relate their high decoding accuracy based on post-movement signal to clinical BCI settings.

      The Reviewer questions the relevance of the specific window parameters used in the present study for clinical BCI applications, particularly for paretic patients who are unable to produce finger movements or for whom afferent sensory feedback is no longer intact. We strongly agree with the Reviewer that any intended clinical application must carefully consider these specific input feature constraints dictated by the clinical cohort, and in turn impose appropriate and complimentary constraints on classifier parameters that may differ from the ones used in the present study.  We now highlight this issue in the Discussion of the revised manuscript and relate our present findings to published clinical BCI work within this context.

      One of the important and fascinating claims of the current study is that the "contextualization" of individual finger movements in a trained sequence specifically occurs during short rest periods in very early skill learning, echoing the recent theory of micro-offline learning proposed by the authors' group. Here, I think two points need to be clarified. First, the concept of "contextualization" is kept somewhat blurry throughout the text. It is only at the later part of the Discussion (around line #330 on page 13) that some potential mechanism for the "contextualization" is provided as "what-and-where" binding. Still, it is unclear what "contextualization" actually is in the current data, as the MEG signal analyzed is extracted from 0-200 msec after the keypress. If one thinks something is contextualizing an action, that contextualization should come earlier than the action itself. 

      The Reviewer requests that we: 1) more clearly define our use of the term “contextualization” and 2) provide the rationale for assessing it over a 200ms window aligned to the keyDown event. This choice of window parameters means that the MEG activity used in our analysis was coincident with, rather than preceding, the actual keypresses.  We define contextualization as the differentiation of representation for the identical movement embedded in different positions of a sequential skill. That is, representations of individual action elements progressively incorporate information about their relationship to the overall sequence structure as the skill is learned. We agree with the Reviewer that this can be appropriately interpreted as “what-and-where” binding. We now incorporate this definition in the Introduction of the revised manuscript as requested.

      The window parameters for optimizing accurate decoding individual finger movements were determined using a grid search of the parameter space (a sliding window of variable width between 25-350 ms with 25 ms increments variably aligned from 0 to +100ms with 10ms increments relative to the keyDown event). This approach generated 140 different temporal windows for each keypress for each participant, with the final parameter selection determined through comparison of the resulting performance between each decoder.  Importantly, the decision to optimize for decoding accuracy placed an emphasis on keypress representations characterized by the most consistent and robust features shared across subjects, which in turn maximize statistical power in detecting common learning-related changes. In this case, the optimal window encompassed a 200ms epoch aligned to the keyDown event (t0 = 0 ms).  We then asked if the representations (i.e. – spatial patterns of combined parcel- and voxel-space activity) of the same digit at two different sequence positions changed with practice within this optimal decoding window.  Of course, our findings do not rule out the possibility that contextualization can also be found before or even after this time window, as we did not directly address this issue in the present study.  Ongoing work in our lab, as pointed out above, is investigating contextualization within different time windows tailored specifically for assessing sequence skill action planning, execution, evaluation and memory processes.

      The second point is that the result provided by the authors is not yet convincing enough to support the claim that "contextualization" occurs during rest. In the original analysis, the authors presented the statistical significance regarding the correlation between the "offline" pattern differentiation and micro-offline skill gain (Figure 5. Supplement 1), as well as the larger "offline" distance than "online" distance (Figure 5B). However, this analysis looks like regressing two variables (monotonically) increasing as a function of the trial. Although some information in this analysis, such as what the independent/dependent variables were or how individual subjects were treated, was missing in the Methods, getting a statistically significant slope seems unsurprising in such a situation. Also, curiously, the same quantitative evidence was not provided for its "online" counterpart, and the authors only briefly mentioned in the text that there was no significant correlation between them. It may be true looking at the data in Figure 5A as the online representation distance looks less monotonically changing, but the classification accuracy presented in Figure 4C, which should reflect similar representational distance, shows a more monotonic increase up to the 11th trial. Further, the ways the "online" and "offline" representation distance was estimated seem to make them not directly comparable. While the "online" distance was computed using all the correct press data within each 10 sec of execution, the "offline" distance is basically computed by only two presses (i.e., the last index_OP5 vs. the first index_OP1 separated by 10 sec of rest). Theoretically, the distance between the neural activity patterns for temporally closer events tends to be closer than that between the patterns for temporally far-apart events. It would be fairer to use the distance between the first index_OP1 vs. the last index_OP5 within an execution period for "online" distance, as well. 

      The Reviewer suggests that the current data is not convincing enough to show that contextualization occurs during rest and raises two important concerns: 1) the relationship between online contextualization and micro-online gains is not shown, and 2) the online distance was calculated differently from its offline counterpart (i.e. - instead of calculating the distance between last IndexOP5 and first IndexOP1 from a single trial, the distance was calculated for each sequence within a trial and then averaged).

      We addressed the first concern by performing individual subject correlations between 1) contextualization changes during rest intervals and micro-offline gains; 2) contextualization changes during practice trials and micro-online gains, and 3) contextualization changes during practice trials and micro-offline gains (Author response image 4). We then statistically compared the resulting correlation coefficient distributions and found that within-subject correlations for contextualization changes during rest intervals and micro-offline gains were significantly higher than online contextualization and micro-online gains (t = 3.2827, p = 0.0015) and online contextualization and micro-offline gains (t = 3.7021, p = 5.3013e-04). These results are consistent with our interpretation that micro-offline gains are supported by contextualization changes during the inter-practice rest period.

      Author response image 4.

      Distribution of individual subject correlation coefficients between contextualization changes occurring during practice or rest with  micro-online and micro-offline performance gains. Note that, the correlation distributions were significantly higher for the relationship between contextualization changes during rest and micro-offline gains than for contextualization changes during practice and either micro-online or offline gain.

      With respect to the second concern highlighted above, we agree with the Reviewer that one limitation of the analysis comparing online versus offline changes in contextualization as presented in the reviewed manuscript, is that it does not eliminate the possibility that any differences could simply be explained by the passage of time (which is smaller for the online analysis compared to the offline analysis). The Reviewer suggests an approach that addresses this issue, which we have now carried out.   When quantifying online changes in contextualization from the first IndexOP1 the last IndexOP5 keypress in the same trial we observed no learning-related trend (Author response image 5, right panel). Importantly, offline distances were significantly larger than online distances regardless of the measurement approach and neither predicted online learning (Author response image 6).

      Author response image 5.

      Trial by trial trend of offline (left panel) and online (middle and right panels) changes in contextualization. Offline changes in contextualization were assessed by calculating the distance between neural representations for the last IndexOP5 keypress in the previous trial and the first IndexOP1 keypress in the present trial. Two different approaches were used to characterize online contextualization changes. The analysis included in the reviewed manuscript (middle panel) calculated the distance between IndexOP1 and IndexOP5 for each correct sequence, which was then averaged across the trial. This approach is limited by the lack of control for the passage of time when making online versus offline comparisons. Thus, the second approach controlled for the passage of time by calculating distance between the representations associated with the first IndexOP1 keypress and the last IndexOP5 keypress within the same trial. Note that while the first approach showed an increase online contextualization trend with practice, the second approach did not.

      Author response image 6.

      Relationship between online contextualization and online learning is shown for both within-sequence (left; note that this is the online contextualization measure used in the reviewd manuscript) and across-sequence (right) distance calculation. There was no significant relationship between online learning and online contextualization regardless of the measurement approach.

      A related concern regarding the control analysis, where individual values for max speed and the degree of online contextualization were compared (Figure 5 Supplement 3), is whether the individual difference is meaningful. If I understood correctly, the optimization of the decoding process (temporal window, feature inclusion/reduction, decoder, etc.) was performed for individual participants, and the same feature extraction was also employed for the analysis of representation distance (i.e., contextualization). If this is the case, the distances are individually differently calculated and they may need to be normalized relative to some stable reference (e.g., 1 vs. 4 or average distance within the control sequence presses) before comparison across the individuals. 

      The Reviewer makes a good point here. We have now implemented the suggested normalization procedure in the analysis provided in the revised manuscript.

      Reviewer #3 (Public review): 

      Summary: 

      One goal of this paper is to introduce a new approach for highly accurate decoding of finger movements from human magnetoencephalography data via dimension reduction of a "multi-scale, hybrid" feature space. Following this decoding approach, the authors aim to show that early skill learning involves "contextualization" of the neural coding of individual movements, relative to their position in a sequence of consecutive movements. Furthermore, they aim to show that this "contextualization" develops primarily during short rest periods interspersed with skill training and correlates with a performance metric which the authors interpret as an indicator of offline learning. <br /> Strengths: 

      A clear strength of the paper is the innovative decoding approach, which achieves impressive decoding accuracies via dimension reduction of a "multi-scale, hybrid space". This hybrid-space approach follows the neurobiologically plausible idea of the concurrent distribution of neural coding across local circuits as well as large-scale networks. A further strength of the study is the large number of tested dimension reduction techniques and classifiers (though the manuscript reveals little about the comparison of the latter). 

      We appreciate the Reviewer’s comments regarding the paper’s strengths.

      A simple control analysis based on shuffled class labels could lend further support to this complex decoding approach. As a control analysis that completely rules out any source of overfitting, the authors could test the decoder after shuffling class labels. Following such shuffling, decoding accuracies should drop to chance level for all decoding approaches, including the optimized decoder. This would also provide an estimate of actual chance-level performance (which is informative over and beyond the theoretical chance level). Furthermore, currently, the manuscript does not explain the huge drop in decoding accuracies for the voxel-space decoding (Figure 3B). Finally, the authors' approach to cortical parcellation raises questions regarding the information carried by varying dipole orientations within a parcel (which currently seems to be ignored?) and the implementation of the mean-flipping method (given that there are two dimensions - space and time - what do the authors refer to when they talk about the sign of the "average source", line 477?). 

      The Reviewer recommends that we: 1) conduct an additional control analysis on classifier performance using shuffled class labels, 2) provide a more detailed explanation regarding the drop in decoding accuracies for the voxel-space decoding following LDA dimensionality reduction (see Fig 3B), and 3) provide additional details on how problems related to dipole solution orientations were addressed in the present study.  

      In relation to the first point, we have now implemented a random shuffling approach as a control for the classification analyses. The results of this analysis indicated that the chance level accuracy was 22.12% (± SD 9.1%) for individual keypress decoding (4-class classification), and 18.41% (± SD 7.4%) for individual sequence item decoding (5-class classification), irrespective of the input feature set or the type of decoder used. Thus, the decoding accuracy observed with the final model was substantially higher than these chance levels.  

      Second, please note that the dimensionality of the voxel-space feature set is very high (i.e. – 15684). LDA attempts to map the input features onto a much smaller dimensional space (number of classes-1; e.g. –  3 dimensions, for 4-class keypress decoding). Given the very high dimension of the voxel-space input features in this case, the resulting mapping exhibits reduced accuracy. Despite this general consideration, please refer to Figure 3—figure supplement 3, where we observe improvement in voxel-space decoder performance when utilizing alternative dimensionality reduction techniques.

      The decoders constructed in the present study assess the average spatial patterns across time (as defined by the windowing procedure) in the input feature space.  We now provide additional details in the Methods of the revised manuscript pertaining to the parcellation procedure and how the sign ambiguity problem was addressed in our analysis.

      Weaknesses: 

      A clear weakness of the paper lies in the authors' conclusions regarding "contextualization". Several potential confounds, described below, question the neurobiological implications proposed by the authors and provide a simpler explanation of the results. Furthermore, the paper follows the assumption that short breaks result in offline skill learning, while recent evidence, described below, casts doubt on this assumption. 

      We thank the Reviewer for giving us the opportunity to address these issues in detail (see below).

      The authors interpret the ordinal position information captured by their decoding approach as a reflection of neural coding dedicated to the local context of a movement (Figure 4). One way to dissociate ordinal position information from information about the moving effectors is to train a classifier on one sequence and test the classifier on other sequences that require the same movements, but in different positions50. In the present study, however, participants trained to repeat a single sequence (4-1-3-2-4). As a result, ordinal position information is potentially confounded by the fixed finger transitions around each of the two critical positions (first and fifth press). Across consecutive correct sequences, the first keypress in a given sequence was always preceded by a movement of the index finger (=last movement of the preceding sequence), and followed by a little finger movement. The last keypress, on the other hand, was always preceded by a ring finger movement, and followed by an index finger movement (=first movement of the next sequence). Figure 4 - Supplement 2 shows that finger identity can be decoded with high accuracy (>70%) across a large time window around the time of the key press, up to at least +/-100 ms (and likely beyond, given that decoding accuracy is still high at the boundaries of the window depicted in that figure). This time window approaches the keypress transition times in this study. Given that distinct finger transitions characterized the first and fifth keypress, the classifier could thus rely on persistent (or "lingering") information from the preceding finger movement, and/or "preparatory" information about the subsequent finger movement, in order to dissociate the first and fifth keypress. Currently, the manuscript provides no evidence that the context information captured by the decoding approach is more than a by-product of temporally extended, and therefore overlapping, but independent neural representations of consecutive keypresses that are executed in close temporal proximity - rather than a neural representation dedicated to context. 

      Such temporal overlap of consecutive, independent finger representations may also account for the dynamics of "ordinal coding"/"contextualization", i.e., the increase in 2-class decoding accuracy, across Day 1 (Figure 4C). As learning progresses, both tapping speed and the consistency of keypress transition times increase (Figure 1), i.e., consecutive keypresses are closer in time, and more consistently so. As a result, information related to a given keypress is increasingly overlapping in time with information related to the preceding and subsequent keypresses. The authors seem to argue that their regression analysis in Figure 5 - Figure Supplement 3 speaks against any influence of tapping speed on "ordinal coding" (even though that argument is not made explicitly in the manuscript). However, Figure 5 - Figure Supplement 3 shows inter-individual differences in a between-subject analysis (across trials, as in panel A, or separately for each trial, as in panel B), and, therefore, says little about the within-subject dynamics of "ordinal coding" across the experiment. A regression of trial-by-trial "ordinal coding" on trial-by-trial tapping speed (either within-subject or at a group-level, after averaging across subjects) could address this issue. Given the highly similar dynamics of "ordinal coding" on the one hand (Figure 4C), and tapping speed on the other hand (Figure 1B), I would expect a strong relationship between the two in the suggested within-subject (or group-level) regression. Furthermore, learning should increase the number of (consecutively) correct sequences, and, thus, the consistency of finger transitions. Therefore, the increase in 2-class decoding accuracy may simply reflect an increasing overlap in time of increasingly consistent information from consecutive keypresses, which allows the classifier to dissociate the first and fifth keypress more reliably as learning progresses, simply based on the characteristic finger transitions associated with each. In other words, given that the physical context of a given keypress changes as learning progresses - keypresses move closer together in time and are more consistently correct - it seems problematic to conclude that the mental representation of that context changes. To draw that conclusion, the physical context should remain stable (or any changes to the physical context should be controlled for). 

      The issues raised by Reviewer #3 here are similar to two issues raised by Reviewer #2 above and agree they must both be carefully considered in any evaluation of our findings.

      As both Reviewers pointed out, the classifiers in this study were trained and tested on keypresses performed while practicing a specific sequence (4-1-3-2-4). The study was designed this way as to avoid the impact of interference effects on learning dynamics. The cross-validated performance of classifiers on MEG data collected within the same session was 90.47% overall accuracy (4-class; Figure 3C). We then tested classifier performance on data collected during a separate MEG session conducted approximately 24 hours later (Day 2; see Figure 3—supplement 3). We observed a reduction in overall accuracy rate to 87.11% when tested on MEG data recorded while participants performed the same learned sequence, and 79.44% when they performed several previously unpracticed sequences. This classification performance difference of 7.67% when tested on the Day 2 data could reflect the performance bias of the classifier for the trained sequence, possibly caused by mixed information from temporally close keypresses being incorporated into the feature weights.

      Along these same lines, both Reviewers also raise the possibility that an increase in “ordinal coding/contextualization” with learning could simply reflect an increase in this mixing effect caused by faster typing speeds as opposed to an actual change in the underlying neural representation. The basic idea is that as correct sequences are generated at higher and higher speeds over training, MEG activity patterns related to the planning, execution, evaluation and memory of individual keypresses overlap more in time. Thus, increased overlap between the “4” and “1” keypresses (at the start of the sequence) and “2” and “4” keypresses (at the end of the sequence) could artefactually increase contextualization distances even if the underlying neural representations for the individual keypresses remain unchanged (assuming this mixing of representations is used by the classifier to differentially tag each index finger press). If this were the case, it follows that such mixing effects reflecting the ordinal sequence structure would also be observable in the distribution of decoder misclassifications. For example, “4” keypresses would be more likely to be misclassified as “1” or “2” keypresses (or vice versa) than as “3” keypresses. The confusion matrices presented in Figures 3C and 4B and Figure 3—figure supplement 3A in the previously submitted manuscript do not show this trend in the distribution of misclassifications across the four fingers.

      Following this logic, it’s also possible that if the ordinal coding is largely driven by this mixing effect, the increased overlap between consecutive index finger keypresses during the 4-4 transition marking the end of one sequence and the beginning of the next one could actually mask contextualization-related changes to the underlying neural representations and make them harder to detect. In this case, a decoder tasked with separating individual index finger keypresses into two distinct classes based upon sequence position might show decreased performance with learning as adjacent keypresses overlapped in time with each other to an increasing extent. However, Figure 4C in our previously submitted manuscript does not support this possibility, as the 2-class hybrid classifier displays improved classification performance over early practice trials despite greater temporal overlap.

      As noted in the above replay to Reviewer #2, we also conducted a new multivariate regression analysis to directly assess whether the neural representation distance score could be predicted by the 4-1, 2-4 and 4-4 keypress transition times observed for each complete correct sequence (both predictor and response variables were z-score normalized within-subject). The results of this analysis affirmed that the possible alternative explanation put forward by the Reviewer is not supported by our data (Adjusted R2 = 0.00431; F = 5.62). We now include this new negative control analysis result in the revised manuscript.

      Finally, the Reviewer hints that one way to address this issue would be to compare MEG responses before and after learning for sequences typed at a fixed speed. However, given that the speed-accuracy trade-off should improve with learning, a comparison between unlearned and learned skill states would dictate that the skill be evaluated at a very low fixed speed. Essentially, such a design presents the problem that the post-training test is evaluating the representation in the unlearned behavioral state that is not representative of the acquired skill. Thus, this approach would not address our experimental question: “do neural representations of the same action performed at different locations within a skill sequence contextually differentiate or remain stable as learning evolves”.

      A similar difference in physical context may explain why neural representation distances ("differentiation") differ between rest and practice (Figure 5). The authors define "offline differentiation" by comparing the hybrid space features of the last index finger movement of a trial (ordinal position 5) and the first index finger movement of the next trial (ordinal position 1). However, the latter is not only the first movement in the sequence but also the very first movement in that trial (at least in trials that started with a correct sequence), i.e., not preceded by any recent movement. In contrast, the last index finger of the last correct sequence in the preceding trial includes the characteristic finger transition from the fourth to the fifth movement. Thus, there is more overlapping information arising from the consistent, neighbouring keypresses for the last index finger movement, compared to the first index finger movement of the next trial. A strong difference (larger neural representation distance) between these two movements is, therefore, not surprising, given the task design, and this difference is also expected to increase with learning, given the increase in tapping speed, and the consequent stronger overlap in representations for consecutive keypresses. Furthermore, initiating a new sequence involves pre-planning, while ongoing practice relies on online planning (Ariani et al., eNeuro 2021), i.e., two mental operations that are dissociable at the level of neural representation (Ariani et al., bioRxiv 2023). 

      The Reviewer argues that the comparison of last finger movement of a trial and the first in the next trial are performed in different circumstances and contexts. This is an important point and one we tend to agree with. For this task, the first sequence in a practice trial (which is pre-planned offline) is performed in a somewhat different context from the sequence iterations that follow, which involve temporally overlapping planning, execution and evaluation processes.  The Reviewer is particularly concerned about a difference in the temporal mixing effect issue raised above between the first and last keypresses performed in a trial. However, in contrast to the Reviewers stated argument above, findings from Korneysheva et. al (2019) showed that neural representations of individual actions are competitively queued during the pre-planning period in a manner that reflects the ordinal structure of the learned sequence.  Thus, mixing effects are likely still present for the first keypress in a trial. Also note that we now present new control analyses in multiple responses above confirming that hypothetical mixing effects between adjacent keypresses do not explain our reported contextualization finding. A statement addressing these possibilities raised by the Reviewer has been added to the Discussion in the revised manuscript.

      In relation to pre-planning, ongoing MEG work in our lab is investigating contextualization within different time windows tailored specifically for assessing how sequence skill action planning evolves with learning.

      Given these differences in the physical context and associated mental processes, it is not surprising that "offline differentiation", as defined here, is more pronounced than "online differentiation". For the latter, the authors compared movements that were better matched regarding the presence of consistent preceding and subsequent keypresses (online differentiation was defined as the mean difference between all first vs. last index finger movements during practice).  It is unclear why the authors did not follow a similar definition for "online differentiation" as for "micro-online gains" (and, indeed, a definition that is more consistent with their definition of "offline differentiation"), i.e., the difference between the first index finger movement of the first correct sequence during practice, and the last index finger of the last correct sequence. While these two movements are, again, not matched for the presence of neighbouring keypresses (see the argument above), this mismatch would at least be the same across "offline differentiation" and "online differentiation", so they would be more comparable. 

      This is the same point made earlier by Reviewer #2, and we agree with this assessment. As stated in the response to Reviewer #2 above, we have now carried out quantification of online contextualization using this approach and included it in the revised manuscript. We thank the Reviewer for this suggestion.

      A further complication in interpreting the results regarding "contextualization" stems from the visual feedback that participants received during the task. Each keypress generated an asterisk shown above the string on the screen, irrespective of whether the keypress was correct or incorrect. As a result, incorrect (e.g., additional, or missing) keypresses could shift the phase of the visual feedback string (of asterisks) relative to the ordinal position of the current movement in the sequence (e.g., the fifth movement in the sequence could coincide with the presentation of any asterisk in the string, from the first to the fifth). Given that more incorrect keypresses are expected at the start of the experiment, compared to later stages, the consistency in visual feedback position, relative to the ordinal position of the movement in the sequence, increased across the experiment. A better differentiation between the first and the fifth movement with learning could, therefore, simply reflect better decoding of the more consistent visual feedback, based either on the feedback-induced brain response, or feedback-induced eye movements (the study did not include eye tracking). It is not clear why the authors introduced this complicated visual feedback in their task, besides consistency with their previous studies.

      We strongly agree with the Reviewer that eye movements related to task engagement are important to rule out as a potential driver of the decoding accuracy or contextualization effect. We address this issue above in response to a question raised by Reviewer #1 about the impact of movement related artefacts in general on our findings.

      First, the assumption the Reviewer makes here about the distribution of errors in this task is incorrect. On average across subjects, 2.32% ± 1.48% (mean ± SD) of all keypresses performed were errors, which were evenly distributed across the four possible keypress responses. While errors increased progressively over practice trials, they did so in proportion to the increase in correct keypresses, so that the overall ratio of correct-to-incorrect keypresses remained stable over the training session. Thus, the Reviewer’s assumptions that there is a higher relative frequency of errors in early trials, and a resulting systematic trend phase shift differences between the visual display updates (i.e. – a change in asterisk position above the displayed sequence) and the keypress performed is not substantiated by the data. To the contrary, the asterisk position on the display and the keypress being executed remained highly correlated over the entire training session. We now include a statement about the frequency and distribution of errors in the revised manuscript.

      Given this high correlation, we firmly agree with the Reviewer that the issue of eye movement-related artefacts is still an important one to address. Fortunately, we did collect eye movement data during the MEG recordings so were able to investigate this. As detailed in the response to Reviewer #1 above, we found that gaze positions and eye-movement velocity time-locked to visual display updates (i.e. – a change in asterisk position above the displayed sequence) did not reflect the asterisk location above chance levels (Overall cross-validated accuracy = 0.21817; see Author response image 1). Furthermore, an inspection of the eye position data revealed that a majority of participants on most trials displayed random walk gaze patterns around a center fixation point, indicating that participants did not attend to the asterisk position on the display. This is consistent with intrinsic generation of the action sequence, and congruent with the fact that the display does not provide explicit feedback related to performance. As pointed out above, a similar real-world example would be manually inputting a long password into a secure online application. In this case, one intrinsically generates the sequence from memory and receives similar feedback about the password sequence position (also provided as asterisks), which is typically ignored by the user. Notably, the minimal participant engagement with the visual task display observed in this study highlights an important difference between behavior observed during explicit sequence learning motor tasks (which is highly generative in nature) with reactive responses to stimulus cues in a serial reaction time task (SRTT).  This is a crucial difference that must be carefully considered when comparing findings across studies. All elements pertaining to this new control analysis are now included in the revised manuscript.

      The authors report a significant correlation between "offline differentiation" and cumulative micro-offline gains. However, it would be more informative to correlate trial-by-trial changes in each of the two variables. This would address the question of whether there is a trial-by-trial relation between the degree of "contextualization" and the amount of micro-offline gains - are performance changes (micro-offline gains) less pronounced across rest periods for which the change in "contextualization" is relatively low? Furthermore, is the relationship between micro-offline gains and "offline differentiation" significantly stronger than the relationship between micro-offline gains and "online differentiation"? 

      In response to a similar issue raised above by Reviewer #2, we now include new analyses comparing correlation magnitudes between (1) “online differention” vs micro-online gains, (2) “online differention” vs micro-offline gains and (3) “offline differentiation” and micro-offline gains (see Author response images 4, 5 and 6 above). These new analyses and results have been added to the revised manuscript. Once again, we thank both Reviewers for this suggestion.

      The authors follow the assumption that micro-offline gains reflect offline learning.

      This statement is incorrect. The original Bonstrup et al (2019) 49 paper clearly states that micro-offline gains must be carefully interpreted based upon the behavioral context within which they are observed, and lays out the conditions under which one can have confidence that micro-offline gains reflect offline learning.  In fact, the excellent meta-analysis of Pan & Rickard (2015) 51, which re-interprets the benefits of sleep in overnight skill consolidation from a “reactive inhibition” perspective, was a crucial resource in the experimental design of our initial study49, as well as in all our subsequent work. Pan & Rickard stated:

      “Empirically, reactive inhibition refers to performance worsening that can accumulate during a period of continuous training (Hull, 1943). It tends to dissipate, at least in part, when brief breaks are inserted between blocks of training. If there are multiple performance-break cycles over a training session, as in the motor sequence literature, performance can exhibit a scalloped effect, worsening during each uninterrupted performance block but improving across blocks52,53. Rickard, Cai, Rieth, Jones, and Ard (2008) and Brawn, Fenn, Nusbaum, and Margoliash (2010) 52,53 demonstrated highly robust scalloped reactive inhibition effects using the commonly employed 30 s–30 s performance break cycle, as shown for Rickard et al.’s (2008) massed practice sleep group in Figure 2. The scalloped effect is evident for that group after the first few 30 s blocks of each session. The absence of the scalloped effect during the first few blocks of training in the massed group suggests that rapid learning during that period masks any reactive inhibition effect.”

      Crucially, Pan & Rickard51 made several concrete recommendations for reducing the impact of the reactive inhibition confound on offline learning studies. One of these recommendations was to reduce practice times to 10s (most prior sequence learning studies up until that point had employed 30s long practice trials). They stated:

      “The traditional design involving 30 s-30 s performance break cycles should be abandoned given the evidence that it results in a reactive inhibition confound, and alternative designs with reduced performance duration per block used instead 51. One promising possibility is to switch to 10 s performance durations for each performance-break cycle Instead 51. That design appears sufficient to eliminate at least the majority of the reactive inhibition effect 52,53.”

      We mindfully incorporated recommendations from Pan and Rickard51  into our own study designs including 1) utilizing 10s practice trials and 2) constraining our analysis of micro-offline gains to early learning trials (where performance monotonically increases and 95% of overall performance gains occur), which are prior to the emergence of the “scalloped” performance dynamics that are strongly linked to reactive inhibition effects. 

      However, there is no direct evidence in the literature that micro-offline gains really result from offline learning, i.e., an improvement in skill level.

      We strongly disagree with the Reviewer’s assertion that “there is no direct evidence in the literature that micro-offline gains really result from offline learning, i.e., an improvement in skill level.”  The initial Bönstrup et al. (2019) 49 report was followed up by a large online crowd-sourcing study (Bönstrup et al., 2020) 54. This second (and much larger) study provided several additional important findings supporting our interpretation of micro-offline gains in cases where the important behavioral conditions clarified above were met (see Author response image 7 below for further details on these conditions).

      Author response image 7.

      Micro-offline gains observed in learning and non-learning contexts are attributed to different underlying causes. (A) Micro-offline and online changes relative to overall trial-by-trial learning. This figure is based on data from Bönstrup et al. (2019) 49. During early learning, micro-offline gains (red bars) closely track trial-by-trial performance gains (green line with open circle markers), with minimal contribution from micro-online gains (blue bars). The stated conclusion in Bönstrup et al. (2019) is that micro-offline gains only during this Early Learning stage reflect rapid memory consolidation (see also 54). After early learning, about practice trial 11, skill plateaus. This plateau skill period is characterized by a striking emergence of coupled (and relatively stable) micro-online drops and micro-offline increases. Bönstrup et al. (2019) as well as others in the literature 55-57, argue that micro-offline gains during the plateau period likely reflect recovery from inhibitory performance factors such as reactive inhibition or fatigue, and thus must be excluded from analyses relating micro-offline gains to skill learning.  The Non-repeating groups in Experiments 3 and 4 from Das et al. (2024) suffer from a lack of consideration of these known confounds.

      Evidence documented in that paper54 showed that micro-offline gains during early skill learning were: 1) replicable and generalized to subjects learning the task in their daily living environment (n=389); 2) equivalent when significantly shortening practice period duration, thus confirming that they are not a result of recovery from performance fatigue (n=118);  3) reduced (along with learning rates) by retroactive interference applied immediately after each practice period relative to interference applied after passage of time (n=373), indicating stabilization of the motor memory at a microscale of several seconds consistent with rapid consolidation; and 4) not modified by random termination of the practice periods, ruling out a contribution of predictive motor slowing (N = 71) 54.  Altogether, our findings were strongly consistent with the interpretation that micro-offline gains reflect memory consolidation supporting early skill learning. This is precisely the portion of the learning curve Pan and Rickard51 refer to when they state “…rapid learning during that period masks any reactive inhibition effect”.

      This interpretation is further supported by brain imaging evidence linking known memory-related networks and consolidation mechanisms to micro-offline gains. First, we reported that the density of fast hippocampo-neocortical skill memory replay events increases approximately three-fold during early learning inter-practice rest periods with the density explaining differences in the magnitude of micro-offline gains across subjects1. Second, Jacobacci et al. (2020) independently reproduced our original behavioral findings and reported BOLD fMRI changes in the hippocampus and precuneus (regions also identified in our MEG study1) linked to micro-offline gains during early skill learning. 33 These functional changes were coupled with rapid alterations in brain microstructure in the order of minutes, suggesting that the same network that operates during rest periods of early learning undergoes structural plasticity over several minutes following practice58. Third, even more recently, Chen et al. (2024) provided direct evidence from intracranial EEG in humans linking sharp-wave ripple events (which are known markers for neural replay59) in the hippocampus (80-120 Hz in humans) with micro-offline gains during early skill learning. The authors report that the strong increase in ripple rates tracked learning behavior, both across blocks and across participants. The authors conclude that hippocampal ripples during resting offline periods contribute to motor sequence learning. 2

      Thus, there is actually now substantial evidence in the literature directly supporting the assertion “that micro-offline gains really result from offline learning”.  On the contrary, according to Gupta & Rickard (2024) “…the mechanism underlying RI [reactive inhibition] is not well established” after over 80 years of investigation60, possibly due to the fact that “reactive inhibition” is a categorical description of behavioral effects that likely result from several heterogenous processes with very different underlying mechanisms.

      On the contrary, recent evidence questions this interpretation (Gupta & Rickard, npj Sci Learn 2022; Gupta & Rickard, Sci Rep 2024; Das et al., bioRxiv 2024). Instead, there is evidence that micro-offline gains are transient performance benefits that emerge when participants train with breaks, compared to participants who train without breaks, however, these benefits vanish within seconds after training if both groups of participants perform under comparable conditions (Das et al., bioRxiv 2024). 

      It is important to point out that the recent work of Gupta & Rickard (2022,2024) 55 does not present any data that directly opposes our finding that early skill learning49 is expressed as micro-offline gains during rest breaks. These studies are essentially an extension of the Rickard et al (2008) paper that employed a massed (30s practice followed by 30s breaks) vs spaced (10s practice followed by 10s breaks) to assess if recovery from reactive inhibition effects could account for performance gains measured after several minutes or hours. Gupta & Rickard (2022) added two additional groups (30s practice/10s break and 10s practice/10s break as used in the work from our group). The primary aim of the study was to assess whether it was more likely that changes in performance when retested 5 minutes after skill training (consisting of 12 practice trials for the massed groups and 36 practice trials for the spaced groups) had ended reflected memory consolidation effects or recovery from reactive inhibition effects. The Gupta & Rickard (2024) follow-up paper employed a similar design with the primary difference being that participants performed a fixed number of sequences on each trial as opposed to trials lasting a fixed duration. This was done to facilitate the fitting of a quantitative statistical model to the data.  To reiterate, neither study included any analysis of micro-online or micro-offline gains and did not include any comparison focused on skill gains during early learning. Instead, Gupta & Rickard (2022), reported evidence for reactive inhibition effects for all groups over much longer training periods. Again, we reported the same finding for trials following the early learning period in our original Bönstrup et al. (2019) paper49 (Author response image 7). Also, please note that we reported in this paper that cumulative micro-offline gains over early learning did not correlate with overnight offline consolidation measured 24 hours later49 (see the Results section and further elaboration in the Discussion). Thus, while the composition of our data is supportive of a short-term memory consolidation process operating over several seconds during early learning, it likely differs from those involved over longer training times and offline periods, as assessed by Gupta & Rickard (2022).

      In the recent preprint from Das et al (2024) 61,  the authors make the strong claim that “micro-offline gains during early learning do not reflect offline learning” which is not supported by their own data.   The authors hypothesize that if “micro-offline gains represent offline learning, participants should reach higher skill levels when training with breaks, compared to training without breaks”.  The study utilizes a spaced vs. massed practice group between-subjects design inspired by the reactive inhibition work from Rickard and others to test this hypothesis. Crucially, the design incorporates only a small fraction of the training used in other investigations to evaluate early skill learning1,33,49,54,57,58,62.  A direct comparison between the practice schedule designs for the spaced and massed groups in Das et al., and the training schedule all participants experienced in the original Bönstrup et al. (2019) paper highlights this issue as well as several others (Author response image 8):

      Author response image 8.

      (A) Comparison of Das et al. Spaced & Massed group training session designs, and the training session design from the original Bönstrup et al. (2019) 49 paper. Similar to the approach taken by Das et al., all practice is visualized as 10-second practice trials with a variable number (either 0, 1 or 30) of 10-second-long inter-practice rest intervals to allow for direct comparisons between designs. The two key takeaways from this comparison are that (1) the intervention differences (i.e. – practice schedules) between the Massed and Spaced groups from the Das et al. report are extremely small (less than 12% of the overall session schedule) and (2) the overall amount of practice is much less than compared to the design from the original Bönstrup report 49  (which has been utilized in several subsequent studies). (B) Group-level learning curve data from Bönstrup et al. (2019) 49 is used to estimate the performance range accounted for by the equivalent periods covering Test 1, Training 1 and Test 2 from Das et al (2024). Note that the intervention in the Das et al. study is limited to a period covering less than 50% of the overall learning range.

      First, participants in the original Bönstrup et al. study 49 experienced 157.14% more practice time and 46.97% less inter-practice rest time than the Spaced group in the Das et al. study (Author response image 8).  Thus, the overall amount of practice and rest differ substantially between studies, with much more limited training occurring for participants in Das et al.  

      Second, and perhaps most importantly, the actual intervention (i.e. – the difference in practice schedule between the Spaced and Massed groups) employed by Das et al. covers a very small fraction of the overall training session. Identical practice schedule segments for both the Spaced & Massed groups are indicated by the red shaded area in Author response image 8. Please note that these identical segments cover 94.84% of the Massed group training schedule and 88.01% of the Spaced group training schedule (since it has 60 seconds of additional rest). This means that the actual interventions cover less than 5% (for Massed) and 12% (for Spaced) of the total training session, which minimizes any chance of observing a difference between groups.

      Also note that the very beginning of the practice schedule (during which Figure R9 shows substantial learning is known to occur) is labeled in the Das et al. study as Test 1.  Test 1 encompasses the first 20 seconds of practice (alternatively viewed as the first two 10-second-long practice trials with no inter-practice rest). This is immediately followed by the Training 1 intervention, which is composed of only three 10-second-long practice trials (with 10-second inter-practice rest for the Spaced group and no inter-practice rest for the Massed group). Author response image 8 also shows that since there is no inter-practice rest after the third Training practice trial for the Spaced group, this third trial (for both Training 1 and 2) is actually a part of an identical practice schedule segment shared by both groups (Massed and Spaced), reducing the magnitude of the intervention even further.

      Moreover, we know from the original Bönstrup et al. (2019) paper49 that 46.57% of all overall group-level performance gains occurred between trials 2 and 5 for that study. Thus, Das et al. are limiting their designed intervention to a period covering less than half of the early learning range discussed in the literature, which again, minimizes any chance of observing an effect.

      This issue is amplified even further at Training 2 since skill learning prior to the long 5-minute break is retained, further constraining the performance range over these three trials. A related issue pertains to the trials labeled as Test 1 (trials 1-2) and Test 2 (trials 6-7) by Das et al. Again, we know from the original Bönstrup et al. paper 49 that 18.06% and 14.43% (32.49% total) of all overall group-level performance gains occurred during trials corresponding to Das et al Test 1 and Test 2, respectively. In other words, Das et al averaged skill performance over 20 seconds of practice at two time-points where dramatic skill improvements occur. Pan & Rickard (1995) previously showed that such averaging is known to inject artefacts into analyses of performance gains.

      Furthermore, the structure of the Test in Das et. al study appears to have an interference effect on the Spaced group performance after the training intervention.  This makes sense if you consider that the Spaced group is required to now perform the task in a Massed practice environment (i.e., two 10-second-long practice trials merged into one long trial), further blurring the true intervention effects. This effect is observable in Figure 1C,E of their pre-print. Specifically, while the Massed group continues to show an increase in performance during test relative to the last 10 seconds of practice during training, the Spaced group displays a marked decrease. This decrease is in stark contrast to the monotonic increases observed for both groups at all other time-points.

      Interestingly, when statistical comparisons between the groups are made at the time-points when the intervention is present (as opposed to after it has been removed) then the stated hypothesis, “If micro-offline gains represent offline learning, participants should reach higher skill levels when training with breaks, compared to training without breaks”, is confirmed.

      The data presented by Gupta and Rickard (2022, 2024) and Das et al. (2024) is in many ways more confirmatory of the constraints employed by our group and others with respect to experimental design, analysis and interpretation of study findings, rather than contradictory. Still, it does highlight a limitation of the current micro-online/offline framework, which was originally only intended to be applied to early skill learning over spaced practice schedules when reactive inhibition effects are minimized49. Extrapolation of this current framework to post-plateau performance periods, longer timespans, or non-learning situations (e.g. – the Non-repeating groups from Experiments 3 & 4 in Das et al. (2024)), when reactive inhibition plays a more substantive role, is not warranted. Ultimately, it will be important to develop new paradigms allowing one to independently estimate the different coincident or antagonistic features (e.g. - memory consolidation, planning, working memory and reactive inhibition) contributing to micro-online and micro-offline gains during and after early skill learning within a unifying framework.

      References

      (1) Buch, E. R., Claudino, L., Quentin, R., Bonstrup, M. & Cohen, L. G. Consolidation of human skill linked to waking hippocampo-neocortical replay. Cell Rep 35, 109193 (2021). https://doi.org:10.1016/j.celrep.2021.109193

      (2) Chen, P.-C., Stritzelberger, J., Walther, K., Hamer, H. & Staresina, B. P. Hippocampal ripples during offline periods predict human motor sequence learning. bioRxiv, 2024.2010.2006.614680 (2024). https://doi.org:10.1101/2024.10.06.614680

      (3) Classen, J., Liepert, J., Wise, S. P., Hallett, M. & Cohen, L. G. Rapid plasticity of human cortical movement representation induced by practice. J Neurophysiol 79, 1117-1123 (1998).

      (4) Karni, A. et al. Functional MRI evidence for adult motor cortex plasticity during motor skill learning. Nature 377, 155-158 (1995). https://doi.org:10.1038/377155a0

      (5) Kleim, J. A., Barbay, S. & Nudo, R. J. Functional reorganization of the rat motor cortex following motor skill learning. J Neurophysiol 80, 3321-3325 (1998).

      (6) Shadmehr, R. & Holcomb, H. H. Neural correlates of motor memory consolidation. Science 277, 821-824 (1997).

      (7) Doyon, J. et al. Experience-dependent changes in cerebellar contributions to motor sequence learning. Proc Natl Acad Sci U S A 99, 1017-1022 (2002).

      (8) Toni, I., Ramnani, N., Josephs, O., Ashburner, J. & Passingham, R. E. Learning arbitrary visuomotor associations: temporal dynamic of brain activity. Neuroimage 14, 1048-1057 (2001).

      (9) Grafton, S. T. et al. Functional anatomy of human procedural learning determined with regional cerebral blood flow and PET. J Neurosci 12, 2542-2548 (1992).

      (10) Kennerley, S. W., Sakai, K. & Rushworth, M. F. Organization of action sequences and the role of the pre-SMA. J Neurophysiol 91, 978-993 (2004). https://doi.org:10.1152/jn.00651.2003 00651.2003 [pii]

      (11) Hardwick, R. M., Rottschy, C., Miall, R. C. & Eickhoff, S. B. A quantitative meta-analysis and review of motor learning in the human brain. Neuroimage 67, 283-297 (2013). https://doi.org:10.1016/j.neuroimage.2012.11.020

      (12) Sawamura, D. et al. Acquisition of chopstick-operation skills with the non-dominant hand and concomitant changes in brain activity. Sci Rep 9, 20397 (2019). https://doi.org:10.1038/s41598-019-56956-0

      (13) Lee, S. H., Jin, S. H. & An, J. The difference in cortical activation pattern for complex motor skills: A functional near- infrared spectroscopy study. Sci Rep 9, 14066 (2019). https://doi.org:10.1038/s41598-019-50644-9

      (14) Battaglia-Mayer, A. & Caminiti, R. Corticocortical Systems Underlying High-Order Motor Control. J Neurosci 39, 4404-4421 (2019). https://doi.org:10.1523/JNEUROSCI.2094-18.2019

      (15) Toni, I., Thoenissen, D. & Zilles, K. Movement preparation and motor intention. Neuroimage 14, S110-117 (2001). https://doi.org:10.1006/nimg.2001.0841

      (16) Wolpert, D. M., Goodbody, S. J. & Husain, M. Maintaining internal representations: the role of the human superior parietal lobe. Nat Neurosci 1, 529-533 (1998). https://doi.org:10.1038/2245

      (17) Andersen, R. A. & Buneo, C. A. Intentional maps in posterior parietal cortex. Annu Rev Neurosci 25, 189-220 (2002). https://doi.org:10.1146/annurev.neuro.25.112701.142922 112701.142922 [pii]

      (18) Buneo, C. A. & Andersen, R. A. The posterior parietal cortex: sensorimotor interface for the planning and online control of visually guided movements. Neuropsychologia 44, 2594-2606 (2006). https://doi.org:S0028-3932(05)00333-7 [pii] 10.1016/j.neuropsychologia.2005.10.011

      (19) Grover, S., Wen, W., Viswanathan, V., Gill, C. T. & Reinhart, R. M. G. Long-lasting, dissociable improvements in working memory and long-term memory in older adults with repetitive neuromodulation. Nat Neurosci 25, 1237-1246 (2022). https://doi.org:10.1038/s41593-022-01132-3

      (20) Colclough, G. L. et al. How reliable are MEG resting-state connectivity metrics? Neuroimage 138, 284-293 (2016). https://doi.org:10.1016/j.neuroimage.2016.05.070

      (21) Colclough, G. L., Brookes, M. J., Smith, S. M. & Woolrich, M. W. A symmetric multivariate leakage correction for MEG connectomes. NeuroImage 117, 439-448 (2015). https://doi.org:10.1016/j.neuroimage.2015.03.071

      (22) Mollazadeh, M. et al. Spatiotemporal variation of multiple neurophysiological signals in the primary motor cortex during dexterous reach-to-grasp movements. J Neurosci 31, 15531-15543 (2011). https://doi.org:10.1523/JNEUROSCI.2999-11.2011

      (23) Bansal, A. K., Vargas-Irwin, C. E., Truccolo, W. & Donoghue, J. P. Relationships among low-frequency local field potentials, spiking activity, and three-dimensional reach and grasp kinematics in primary motor and ventral premotor cortices. J Neurophysiol 105, 1603-1619 (2011). https://doi.org:10.1152/jn.00532.2010

      (24) Flint, R. D., Ethier, C., Oby, E. R., Miller, L. E. & Slutzky, M. W. Local field potentials allow accurate decoding of muscle activity. J Neurophysiol 108, 18-24 (2012). https://doi.org:10.1152/jn.00832.2011

      (25) Churchland, M. M. et al. Neural population dynamics during reaching. Nature 487, 51-56 (2012). https://doi.org:10.1038/nature11129

      (26) Bassett, D. S. et al. Dynamic reconfiguration of human brain networks during learning. Proc Natl Acad Sci U S A 108, 7641-7646 (2011). https://doi.org:10.1073/pnas.1018985108

      (27) Albouy, G., King, B. R., Maquet, P. & Doyon, J. Hippocampus and striatum: dynamics and interaction during acquisition and sleep-related motor sequence memory consolidation. Hippocampus 23, 985-1004 (2013). https://doi.org:10.1002/hipo.22183

      (28) Albouy, G. et al. Neural correlates of performance variability during motor sequence acquisition. Neuroimage 60, 324-331 (2012). https://doi.org:10.1016/j.neuroimage.2011.12.049

      (29) Qin, Y. L., McNaughton, B. L., Skaggs, W. E. & Barnes, C. A. Memory reprocessing in corticocortical and hippocampocortical neuronal ensembles. Philos Trans R Soc Lond B Biol Sci 352, 1525-1533 (1997). https://doi.org:10.1098/rstb.1997.0139

      (30) Euston, D. R., Tatsuno, M. & McNaughton, B. L. Fast-forward playback of recent memory sequences in prefrontal cortex during sleep. Science 318, 1147-1150 (2007). https://doi.org:10.1126/science.1148979

      (31) Molle, M. & Born, J. Hippocampus whispering in deep sleep to prefrontal cortex--for good memories? Neuron 61, 496-498 (2009). https://doi.org:S0896-6273(09)00122-6 [pii] 10.1016/j.neuron.2009.02.002

      (32) Frankland, P. W. & Bontempi, B. The organization of recent and remote memories. Nat Rev Neurosci 6, 119-130 (2005). https://doi.org:10.1038/nrn1607

      (33) Jacobacci, F. et al. Rapid hippocampal plasticity supports motor sequence learning. Proc Natl Acad Sci U S A 117, 23898-23903 (2020). https://doi.org:10.1073/pnas.2009576117

      (34) Albouy, G. et al. Maintaining vs. enhancing motor sequence memories: respective roles of striatal and hippocampal systems. Neuroimage 108, 423-434 (2015). https://doi.org:10.1016/j.neuroimage.2014.12.049

      (35) Gais, S. et al. Sleep transforms the cerebral trace of declarative memories. Proc Natl Acad Sci U S A 104, 18778-18783 (2007). https://doi.org:0705454104 [pii] 10.1073/pnas.0705454104

      (36) Sterpenich, V. et al. Sleep promotes the neural reorganization of remote emotional memory. J Neurosci 29, 5143-5152 (2009). https://doi.org:10.1523/JNEUROSCI.0561-09.2009

      (37) Euston, D. R., Gruber, A. J. & McNaughton, B. L. The role of medial prefrontal cortex in memory and decision making. Neuron 76, 1057-1070 (2012). https://doi.org:10.1016/j.neuron.2012.12.002

      (38) van Kesteren, M. T., Fernandez, G., Norris, D. G. & Hermans, E. J. Persistent schema-dependent hippocampal-neocortical connectivity during memory encoding and postencoding rest in humans. Proc Natl Acad Sci U S A 107, 7550-7555 (2010). https://doi.org:10.1073/pnas.0914892107

      (39) van Kesteren, M. T., Ruiter, D. J., Fernandez, G. & Henson, R. N. How schema and novelty augment memory formation. Trends Neurosci 35, 211-219 (2012). https://doi.org:10.1016/j.tins.2012.02.001

      (40) Wagner, A. D. et al. Building memories: remembering and forgetting of verbal experiences as predicted by brain activity. Science (New York, N.Y.) 281, 1188-1191 (1998).

      (41) Ashe, J., Lungu, O. V., Basford, A. T. & Lu, X. Cortical control of motor sequences. Curr Opin Neurobiol 16, 213-221 (2006).

      (42) Hikosaka, O., Nakamura, K., Sakai, K. & Nakahara, H. Central mechanisms of motor skill learning. Curr Opin Neurobiol 12, 217-222 (2002).

      (43) Penhune, V. B. & Steele, C. J. Parallel contributions of cerebellar, striatal and M1 mechanisms to motor sequence learning. Behav. Brain Res. 226, 579-591 (2012). https://doi.org:10.1016/j.bbr.2011.09.044

      (44) Doyon, J. et al. Contributions of the basal ganglia and functionally related brain structures to motor learning. Behavioural brain research 199, 61-75 (2009). https://doi.org:10.1016/j.bbr.2008.11.012

      (45) Schendan, H. E., Searl, M. M., Melrose, R. J. & Stern, C. E. An FMRI study of the role of the medial temporal lobe in implicit and explicit sequence learning. Neuron 37, 1013-1025 (2003). https://doi.org:10.1016/s0896-6273(03)00123-5

      (46) Morris, R. G. M. Elements of a neurobiological theory of hippocampal function: the role of synaptic plasticity, synaptic tagging and schemas. The European journal of neuroscience 23, 2829-2846 (2006). https://doi.org:10.1111/j.1460-9568.2006.04888.x

      (47) Tse, D. et al. Schemas and memory consolidation. Science 316, 76-82 (2007). https://doi.org:10.1126/science.1135935

      (48) Berlot, E., Popp, N. J. & Diedrichsen, J. A critical re-evaluation of fMRI signatures of motor sequence learning. Elife 9 (2020). https://doi.org:10.7554/eLife.55241

      (49) Bonstrup, M. et al. A Rapid Form of Offline Consolidation in Skill Learning. Curr Biol 29, 1346-1351 e1344 (2019). https://doi.org:10.1016/j.cub.2019.02.049

      (50) Kornysheva, K. et al. Neural Competitive Queuing of Ordinal Structure Underlies Skilled Sequential Action. Neuron 101, 1166-1180 e1163 (2019). https://doi.org:10.1016/j.neuron.2019.01.018

      (51) Pan, S. C. & Rickard, T. C. Sleep and motor learning: Is there room for consolidation? Psychol Bull 141, 812-834 (2015). https://doi.org:10.1037/bul0000009

      (52) Rickard, T. C., Cai, D. J., Rieth, C. A., Jones, J. & Ard, M. C. Sleep does not enhance motor sequence learning. J Exp Psychol Learn Mem Cogn 34, 834-842 (2008). https://doi.org:10.1037/0278-7393.34.4.834

      53) Brawn, T. P., Fenn, K. M., Nusbaum, H. C. & Margoliash, D. Consolidating the effects of waking and sleep on motor-sequence learning. J Neurosci 30, 13977-13982 (2010). https://doi.org:10.1523/JNEUROSCI.3295-10.2010

      (54) Bonstrup, M., Iturrate, I., Hebart, M. N., Censor, N. & Cohen, L. G. Mechanisms of offline motor learning at a microscale of seconds in large-scale crowdsourced data. NPJ Sci Learn 5, 7 (2020). https://doi.org:10.1038/s41539-020-0066-9

      (55) Gupta, M. W. & Rickard, T. C. Dissipation of reactive inhibition is sufficient to explain post-rest improvements in motor sequence learning. NPJ Sci Learn 7, 25 (2022). https://doi.org:10.1038/s41539-022-00140-z

      (56) Jacobacci, F. et al. Rapid hippocampal plasticity supports motor sequence learning. Proceedings of the National Academy of Sciences 117, 23898-23903 (2020).

      (57) Brooks, E., Wallis, S., Hendrikse, J. & Coxon, J. Micro-consolidation occurs when learning an implicit motor sequence, but is not influenced by HIIT exercise. NPJ Sci Learn 9, 23 (2024). https://doi.org:10.1038/s41539-024-00238-6

      (58) Deleglise, A. et al. Human motor sequence learning drives transient changes in network topology and hippocampal connectivity early during memory consolidation. Cereb Cortex 33, 6120-6131 (2023). https://doi.org:10.1093/cercor/bhac489

      (59) Buzsaki, G. Hippocampal sharp wave-ripple: A cognitive biomarker for episodic memory and planning. Hippocampus 25, 1073-1188 (2015). https://doi.org:10.1002/hipo.22488

      (60) Gupta, M. W. & Rickard, T. C. Comparison of online, offline, and hybrid hypotheses of motor sequence learning using a quantitative model that incorporate reactive inhibition. Sci Rep 14, 4661 (2024). https://doi.org:10.1038/s41598-024-52726-9

      (61) Das, A., Karagiorgis, A., Diedrichsen, J., Stenner, M.-P. & Azanon, E. “Micro-offline gains” convey no benefit for motor skill learning. bioRxiv, 2024.2007.2011.602795 (2024). https://doi.org:10.1101/2024.07.11.602795

      (62) Mylonas, D. et al. Maintenance of Procedural Motor Memory across Brief Rest Periods Requires the Hippocampus. J Neurosci 44 (2024). https://doi.org:10.1523/JNEUROSCI.1839-23.2024

    1. Author response:

      eLife assessment

      This potentially useful study involves neuro-imaging and electrophysiology in a small cohort of congenital cataract patients after sight recovery and age-matched control participants with normal sight. It aims to characterize the effects of early visual deprivation on excitatory and inhibitory balance in the visual cortex. While the findings are taken to suggest the existence of persistent alterations in Glx/GABA ratio and aperiodic EEG signals, the evidence supporting these claims is incomplete. Specifically, small sample sizes, lack of a specific control cohort, and other methodological limitations will likely restrict the usefulness of the work, with relevance limited to scientists working in this particular subfield.

      As pointed out in the public reviews, there are only very few human models which allow for assessing the role of early experience on neural circuit development. While the prevalent research in permanent congenital blindness reveals the response and adaptation of the developing brain to an atypical situation (blindness), research in sight restoration addresses the question of whether and how atypical development can be remediated if typical experience (vision) is restored. The literature on the role of visual experience in the development of E/I balance in humans, assessed via Magnetic Resonance Spectroscopy (MRS), has been limited to a few studies on congenital permanent blindness. Thus, we assessed sight recovery individuals with a history of congenital blindness, as limited evidence from other researchers indicated that the visual cortex E/I ratio might differ compared to normally sighted controls.

      Individuals with total bilateral congenital cataracts who remained untreated until later in life are extremely rare, particularly if only carefully diagnosed patients are included in a study sample. A sample size of 10 patients is, at the very least, typical of past studies in this population, even for exclusively behavioral assessments. In the present study, in addition to behavioral assessment as an indirect measure of sensitive periods, we investigated participants with two neuroimaging methods (Magnetic Resonance Spectroscopy and electroencephalography) to directly assess the neural correlates of sensitive periods in humans. The electroencephalography data allowed us to link the results of our small sample to findings documented in large cohorts of both, sight recovery individuals and permanently congenitally blind individuals. As pointed out in a recent editorial recommending an “exploration-then-estimation procedure,” (“Consideration of Sample Size in Neuroscience Studies,” 2020), exploratory studies like ours provide crucial direction and specific hypotheses for future work.

      We included an age-matched sighted control group recruited from the same community, measured in the same scanner and laboratory, to assess whether early experience is necessary for a typical excitatory/inhibitory (E/I) ratio to emerge in adulthood. The present findings indicate that this is indeed the case. Based on these results, a possible question to answer in future work, with individuals who had developmental cataracts, is whether later visual deprivation causes similar effects. Note that even if visual deprivation at a later stage in life caused similar effects, the current results would not be invalidated; by contrast, they are essential to understand future work on late (permanent or transient) blindness.

      Thus, we think that the present manuscript has far reaching implications for our understanding of the conditions under which E/I balance, a crucial characteristic of brain functioning, emerges in humans.

      Finally, our manuscript is one of the first few studies which relates MRS neurotransmitter concentrations to parameters of EEG aperiodic activity. Since present research has been using aperiodic activity as a correlate of the E/I ratio, and partially of higher cognitive functions, we think that our manuscript additionally contributes to a better understanding of what might be measured with aperiodic neurophysiological activity.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      In this human neuroimaging and electrophysiology study, the authors aimed to characterize the effects of a period of visual deprivation in the sensitive period on excitatory and inhibitory balance in the visual cortex. They attempted to do so by comparing neurochemistry conditions ('eyes open', 'eyes closed') and resting state, and visually evoked EEG activity between ten congenital cataract patients with recovered sight (CC), and ten age-matched control participants (SC) with normal sight.

      First, they used magnetic resonance spectroscopy to measure in vivo neurochemistry from two locations, the primary location of interest in the visual cortex, and a control location in the frontal cortex. Such voxels are used to provide a control for the spatial specificity of any effects because the single-voxel MRS method provides a single sampling location. Using MR-visible proxies of excitatory and inhibitory neurotransmission, Glx and GABA+ respectively, the authors report no group effects in GABA+ or Glx, no difference in the functional conditions 'eyes closed' and 'eyes open'. They found an effect of the group in the ratio of Glx/GABA+ and no similar effect in the control voxel location. They then performed multiple exploratory correlations between MRS measures and visual acuity, and reported a weak positive correlation between the 'eyes open' condition and visual acuity in CC participants.

      The same participants then took part in an EEG experiment. The authors selected only two electrodes placed in the visual cortex for analysis and reported a group difference in an EEG index of neural activity, the aperiodic intercept, as well as the aperiodic slope, considered a proxy for cortical inhibition. They report an exploratory correlation between the aperiodic intercept and Glx in one out of three EEG conditions.

      The authors report the difference in E/I ratio, and interpret the lower E/I ratio as representing an adaptation to visual deprivation, which would have initially caused a higher E/I ratio. Although intriguing, the strength of evidence in support of this view is not strong. Amongst the limitations are the low sample size, a critical control cohort that could provide evidence for a higher E/I ratio in CC patients without recovered sight for example, and lower data quality in the control voxel.

      Strengths of study:

      How sensitive period experience shapes the developing brain is an enduring and important question in neuroscience. This question has been particularly difficult to investigate in humans. The authors recruited a small number of sight-recovered participants with bilateral congenital cataracts to investigate the effect of sensitive period deprivation on the balance of excitation and inhibition in the visual brain using measures of brain chemistry and brain electrophysiology. The research is novel, and the paper was interesting and well-written.

      Limitations:

      (1.1) Low sample size. Ten for CC and ten for SC, and a further two SC participants were rejected due to a lack of frontal control voxel data. The sample size limits the statistical power of the dataset and increases the likelihood of effect inflation.

      Applying strict criteria, we only included individuals who were born with no patterned vision in the CC group. The population of individuals who have remained untreated past infancy is small in India, despite a higher prevalence of childhood cataract than Germany. Indeed, from the original 11 CC and 11 SC participants tested, one participant each from the CC and SC group had to be rejected, as their data had been corrupted, resulting in 10 participants in each group.

      It was a challenge to recruit participants from this rare group with no history of neurological diagnosis/intake of neuromodulatory medications, who were able and willing to undergo both MRS and EEG. For this study, data collection took more than 1.5 years.

      We took care of the validity of our results with two measures; first, assessed not just MRS, but additionally, EEG measures of E/I ratio. The latter allowed us to link results to a larger population of CC individuals, that is, we replicated the results of a larger group of 38 individuals (Ossandón et al., 2023) in our sub-group.

      Second, we included a control voxel. As predicted, all group effects were restricted to the occipital voxel.

      (1.2) Lack of specific control cohort. The control cohort has normal vision. The control cohort is not specific enough to distinguish between people with sight loss due to different causes and patients with congenital cataracts with co-morbidities. Further data from more specific populations, such as patients whose cataracts have not been removed, with developmental cataracts, or congenitally blind participants, would greatly improve the interpretability of the main finding. The lack of a more specific control cohort is a major caveat that limits a conclusive interpretation of the results.

      The existing work on visual deprivation and neurochemical changes, as assessed with MRS, has been limited to permanent congenital blindness. In fact, most of the studies on permanent blindness included only congenitally blind or early blind humans (Coullon et al., 2015; Weaver et al., 2013), or, in separate studies, only late-blind individuals (Bernabeu et al., 2009). Thus, accordingly, we started with the most “extreme” visual deprivation model, sight recovery after congenital blindness. If we had not observed any group difference compared to normally sighted controls, investigating other groups might have been trivial. Based on our results, subsequent studies in late blind individuals, and then individuals with developmental cataracts, can be planned with clear hypotheses.

      (1.3) MRS data quality differences. Data quality in the control voxel appears worse than in the visual cortex voxel. The frontal cortex MRS spectrum shows far broader linewidth than the visual cortex (Supplementary Figures). Compared to the visual voxel, the frontal cortex voxel has less defined Glx and GABA+ peaks; lower GABA+ and Glx concentrations, lower NAA SNR values; lower NAA concentrations. If the data quality is a lot worse in the FC, then small effects may not be detectable.

      Worse data quality in the frontal than the visual cortex has been repeatedly observed in the MRS literature, attributable to magnetic field distortions (Juchem & Graaf, 2017) resulting from the proximity of the region to the sinuses (recent example: (Rideaux et al., 2022)). Nevertheless, we chose the frontal control region rather than a parietal voxel, given the potential  neurochemical changes in multisensory regions of the parietal cortex due to blindness. Such reorganization would be less likely in frontal areas associated with higher cognitive functions. Further, prior MRS studies of the visual cortex have used the frontal cortex as a control region as well (Pitchaimuthu et al., 2017; Rideaux et al., 2022).

      In the present study, we checked that the frontal cortex datasets for Glx and GABA+ concentrations were of sufficient quality: the fit error was below 8.31% in both groups (Supplementary Material S3). For reference, Mikkelsen et al. reported a mean GABA+ fit error of 6.24 +/- 1.95% from a posterior cingulate cortex voxel across 8 GE scanners, using the Gannet pipeline. No absolute cutoffs have been proposed for fit errors. However, MRS studies in special populations (I/E ratio assessed in narcolepsy (Gao et al., 2024), GABA concentration assessed in Autism Spectrum Disorder (Maier et al., 2022)) have used frontal cortex data with a fit error of <10% to identify differences between cohorts (Gao et al., 2024; Pitchaimuthu et al., 2017). Based on the literature, MRS data from the frontal voxel of the present study would have been of sufficient quality to uncover group differences.

      In the revised manuscript, we will add the recently published MRS quality assessment form to the supplementary materials. Additionally, we would like to allude to our apriori prediction of group differences for the visual cortex, but not for the frontal cortex voxel.

      (1.4) Because of the direction of the difference in E/I, the authors interpret their findings as representing signatures of sight improvement after surgery without further evidence, either within the study or from the literature. However, the literature suggests that plasticity and visual deprivation drive the E/I index up rather than down. Decreasing GABA+ is thought to facilitate experience-dependent remodelling. What evidence is there that cortical inhibition increases in response to a visual cortex that is over-sensitised due to congenital cataracts? Without further experimental or literature support this interpretation remains very speculative.

      Indeed, higher inhibition was not predicted, which we attempt to reconcile in our discussion section. We base our discussion mainly on the non-human animal literature, which has shown evidence of homeostatic changes after prolonged visual deprivation in the adult brain (Barnes et al., 2015). It is also interesting to note that after monocular deprivation in adult humans, resting GABA+ levels decreased in the visual cortex (Lunghi et al., 2015). Assuming that after delayed sight restoration, adult neuroplasticity mechanisms must be employed, these studies would predict a “balancing” of the increased excitatory drive following sight restoration by a commensurate increase in inhibition (Keck et al., 2017). Additionally, the EEG results of the present study allowed for speculation regarding the underlying neural mechanisms of an altered E/I ratio. The aperiodic EEG activity suggested higher spontaneous spiking (increased intercept) and increased inhibition (steeper aperiodic slope between 1-20 Hz) in CC vs SC individuals (Ossandón et al., 2023).

      In the revised manuscript, we will more clearly indicate that these speculations are based primarily on non-human animal work, due to the lack of human studies on the subject.

      (1.5) Heterogeneity in the patient group. Congenital cataract (CC) patients experienced a variety of duration of visual impairment and were of different ages. They presented with co-morbidities (absorbed lens, strabismus, nystagmus). Strabismus has been associated with abnormalities in GABAergic inhibition in the visual cortex. The possible interactions with residual vision and confounds of co-morbidities are not experimentally controlled for in the correlations, and not discussed.

      The goal of the present study was to assess whether we would observe changes in E/I ratio after restoring vision at all. We would not have included patients without nystagmus in the CC group of the present study, since it would have been unlikely that they experienced congenital patterned visual deprivation. Amongst diagnosticians, nystagmus or strabismus might not be considered genuine “comorbidities” that emerge in people with congenital cataracts. Rather, these are consequences of congenital visual deprivation, which we employed as diagnostic criteria. Similarly, absorbed lenses are clear signs that cataracts were congenital. As in other models of experience dependent brain development (e.g. the extant literature on congenital permanent blindness, including anophthalmic individuals (Coullon et al., 2015; Weaver et al., 2013), some uncertainty remains regarding whether the (remaining, in our case) abnormalities of the eye, or the blindness they caused, are the factors driving neural changes. In case of people with reversed congenital cataracts, at least the retina is considered to be intact, as they would otherwise not receive cataract removal surgery.

      However, we consider it unlikely that strabismus caused the group differences, because the present study shows group differences in the Glx/GABA+ ratio at rest, regardless of eye opening or eye closure, for which strabismus would have caused distinct effects. By contrast, the link between GABA concentration and, for example, interocular suppression in strabismus, have so far been documented during visual stimulation (Mukerji et al., 2022; Sengpiel et al., 2006), and differed in direction depending on the amblyopic vs. non-amblyopic eye. Further, one MRS study did not find group differences in GABA concentration between the visual cortices of 16 amblyopic individuals and sighted controls (Mukerji et al., 2022), supporting that the differences in Glx/GABA+ concentration which we observed were driven by congenital deprivation, and not amblyopia-associated visual acuity or eye movement differences.  

      In the revised manuscript, we will discuss the inclusion criteria in more detail, and the aforementioned reasons why our data remains interpretable.

      (1.6) Multiple exploratory correlations were performed to relate MRS measures to visual acuity (shown in Supplementary Materials), and only specific ones were shown in the main document. The authors describe the analysis as exploratory in the 'Methods' section. Furthermore, the correlation between visual acuity and E/I metric is weak, and not corrected for multiple comparisons. The results should be presented as preliminary, as no strong conclusions can be made from them. They can provide a hypothesis to test in a future study.

      In the revised manuscript, we will clearly indicate that the exploratory correlation analyses are reported to put forth hypotheses for future studies.

      (1.7) P.16 Given the correlation of the aperiodic intercept with age ("Age negatively correlated with the aperiodic intercept across CC and SC individuals, that is, a flattening of the intercept was observed with age"), age needs to be controlled for in the correlation between neurochemistry and the aperiodic intercept. Glx has also been shown to negatively correlate with age.

      The correlation between chronological age and aperiodic intercept was observed across groups, but the correlation between Glx and the intercept of the aperiodic EEG activity was seen only in the CC group, even though the SC group was matched for age. Thus, such a correlation was very unlikely to  be predominantly driven by an effect of chronological age.

      In the revised manuscript, we will add the linear regressions with age as a covariate included below, for the relationship between aperiodic intercept and Glx concentration in the CC group. 

      a. A linear regression was conducted within the CC group to predict the intercept during visual stimulation, based on age and visual cortex Glx concentration. The results of the regression analysis indicated that the model explained a significant proportion of the variance in the aperiodic intercept, 𝑅2\=0.82_, t_(2,7)=16.1_, 𝑝=0.0024._ Note that the coefficient for age was not significant, 𝛽=0.007, t(7)=0.82, 𝑝=0.439. The regression coefficients and their respective statistics are presented in Author response table 1.

      Author response table 1.

      Regression Analysis Summary for Predicting Aperiodic Intercept (Visual Stimulation) in the CC group

      b. A linear regression was conducted to predict the intercept during eye opening at rest, based on age and visual cortex Glx concentration. The results of the regression analysis indicated that the model explained a significant proportion of the variance in the aperiodic intercept, 𝑅2\=0.842_, t_(2,7)=18.6,  𝑝=0.00159_._ Note that the coefficient for age was not significant, 𝛽=−0.005, t(7)=−0.90, 𝑝=0.400. The regression coefficients and their respective statistics are presented in Author response table 2.

      Author response table 2.

      Regression Analysis Summary for Predicting Aperiodic Intercept (Eyes Open) in the CC group

      c. Given that the Glx coefficient is significant in both models and age does not significantly predict either outcome, it can be concluded that Glx independently predicts the intercept of the aperiodic intercept.

      (1.8) Multiple exploratory correlations were performed to relate MRS to EEG measures (shown in Supplementary Materials), and only specific ones were shown in the main document. Given the multiple measures from the MRS, the correlations with the EEG measures were exploratory, as stated in the text, p.16, and in Figure 4. Yet the introduction said that there was a prior hypothesis "We further hypothesized that neurotransmitter changes would relate to changes in the slope and intercept of the EEG aperiodic activity in the same subjects." It would be great if the text could be revised for consistency and the analysis described as exploratory.

      In the revised manuscript, we will improve the phrasing. We consider the correlation analyses as exploratory due to our sample size and the absence of prior work. However, we did hypothesize that both MRS and EEG markers would concurrently be altered in CC vs SC individuals.

      (1.9) The analysis for the EEG needs to take more advantage of the available data. As far as I understand, only two electrodes were used, yet far more were available as seen in their previous study (Ossandon et al., 2023). The spatial specificity is not established. The authors could use the frontal cortex electrode (FP1, FP2) signals as a control for spatial specificity in the group effects, or even better, all available electrodes and correct for multiple comparisons. Furthermore, they could use the aperiodic intercept vs Glx in SC to evaluate the specificity of the correlation to CC.

      The aperiodic intercept and slope did not differ between CC and SC individuals for Fp1 and Fp2, suggesting the spatial specificity of the results. In the revised manuscript, we will add this analysis to the supplementary material.

      Author response image 1.

      Aperiodic intercept (top) and slope (bottom) for congenital cataract-reversal (CC, red) and age-matched normally sighted control (SC, blue) individuals. Distributions of these parameters are displayed as violin plots for three conditions; at rest with eyes closed (EC), at rest with eyes open (EO) and during visual stimulation (LU). Aperiodic parameters were calculated across electrodes Fp1 and Fp2. Solid black lines indicate mean values, dotted black lines indicate median values. Coloured lines connect values of individual participants across conditions.

      Further, Glx concentration in the visual cortex did not correlate with the aperiodic intercept in the SC group (Figure 4), suggesting that this relationship was indeed specific to the CC group.

      The data from all electrodes has been analyzed and published in other studies as well (Pant et al., 2023; Ossandón et al., 2023).

      Reviewer #2 (Public Review):

      Summary:

      The manuscript reports non-invasive measures of activity and neurochemical profiles of the visual cortex in congenitally blind patients who recovered vision through the surgical removal of bilateral dense cataracts. The declared aim of the study is to find out how restoring visual function after several months or years of complete blindness impacts the balance between excitation and inhibition in the visual cortex.

      Strengths:

      The findings are undoubtedly useful for the community, as they contribute towards characterising the many ways this special population differs from normally sighted individuals. The combination of MRS and EEG measures is a promising strategy to estimate a fundamental physiological parameter - the balance between excitation and inhibition in the visual cortex, which animal studies show to be heavily dependent upon early visual experience. Thus, the reported results pave the way for further studies, which may use a similar approach to evaluate more patients and control groups.

      Weaknesses:

      (2.1) The main issue is the lack of an appropriate comparison group or condition to delineate the effect of sight recovery (as opposed to the effect of congenital blindness). Few previous studies suggested an increased excitation/Inhibition ratio in the visual cortex of congenitally blind patients; the present study reports a decreased E/I ratio instead. The authors claim that this implies a change of E/I ratio following sight recovery. However, supporting this claim would require showing a shift of E/I after vs. before the sight-recovery surgery, or at least it would require comparing patients who did and did not undergo the sight-recovery surgery (as common in the field).

      Longitudinal studies would indeed be the best way to test the hypothesis that the lower E/I ratio in the CC group observed by the present study is a consequence of sight restoration. However, longitudinal studies involving neuroimaging are an effortful challenge, particularly in research conducted outside of major developed countries and dedicated neuroimaging research facilities. Crucially, however, had CC and SC individuals, as well as permanently congenitally blind vs SC individuals (Coullon et al., 2015; Weaver et al., 2013), not differed on any neurochemical markers, such a longitudinal study might have been trivial. Thus, in order to justify and better tailor longitudinal studies, cross-sectional studies are an initial step.

      (2.2) MR Spectroscopy shows a reduced GLX/GABA ratio in patients vs. sighted controls; however, this finding remains rather isolated, not corroborated by other observations. The difference between patients and controls only emerges for the GLX/GABA ratio, but there is no accompanying difference in either the GLX or the GABA concentrations. There is an attempt to relate the MRS data with acuity measurements and electrophysiological indices, but the explorative correlational analyses do not help to build a coherent picture. A bland correlation between GLX/GABA and visual impairment is reported, but this is specific to the patients' group (N=10) and would not hold across groups (the correlation is positive, predicting the lowest GLX/GABA ratio values for the sighted controls - the opposite of what is found). There is also a strong correlation between GLX concentrations and the EEG power at the lowest temporal frequencies. Although this relation is intriguing, it only holds for a very specific combination of parameters (of the many tested): only with eyes open, only in the patient group.

      We interpret these findings differently, that is, in the context of experiments from non-human animals and the larger MRS literature.

      Homeostatic control of E/I balance assumes that the ratio of excitation (reflected here by Glx) and inhibition (reflected here by GABA+) is regulated. Like prior work (Gao et al., 2024, 2024; Narayan et al., 2022; Perica et al., 2022; Steel et al., 2020; Takado et al., 2022; Takei et al., 2016), we assumed that the ratio of Glx/GABA+ is indicative of E/I balance rather than solely the individual neurotransmitter levels. One of the motivations for assessing the ratio vs the absolute concentration is that as per the underlying E/I balance hypothesis, a change in excitation would cause a concomitant change in inhibition, and vice versa, which has been shown in non-human animal work (Fang et al., 2021; Haider et al., 2006; Tao & Poo, 2005) and modeling research (Vreeswijk & Sompolinsky, 1996; Wu et al., 2022). Importantly, our interpretation of the lower E/I ratio is not just from the Glx/GABA+ ratio, but additionally, based on the steeper EEG aperiodic slope (1-20 Hz).  

      As in the discussion section and response 1.4, we did not expect to see a lower Glx/GABA+ ratio in CC individuals. We discuss the possible reasons for the direction of the correlation with visual acuity and aperiodic offset during passive visual stimulation, and offer interpretations and (testable) hypotheses.

      We interpret the direction of the  Glx/GABA+ correlation with visual acuity to imply that patients with highest (compensatory) balancing of the consequences of congenital blindness (hyperexcitation), in light of visual stimulation, are those who recover best. Note, the sighted control group was selected based on their “normal” vision. Thus, clinical visual acuity measures are not expected to sufficiently vary, nor have the resolution to show strong correlations with neurophysiological measures. By contrast, the CC group comprised patients highly varying in visual outcomes, and thus were ideal to investigate such correlations.

      This holds for the correlation between Glx and the aperiodic intercept, as well. Previous work has suggested that the intercept of the aperiodic activity is associated with broadband spiking activity in neural circuits (Manning et al., 2009). Thus, an atypical increase of spiking activity during visual stimulation, as indirectly suggested by “old” non-human primate work on visual deprivation (Hyvärinen et al., 1981) might drive a correlation not observed in healthy populations.

      In the revised manuscript, we will more clearly indicate in the discussion that these are possible post-hoc interpretations. We argue that given the lack of such studies in humans, it is all the more important that extant data be presented completely, even if the direction of the effects are not as expected.

      (2.3) For these reasons, the reported findings do not allow us to draw firm conclusions on the relation between EEG parameters and E/I ratio or on the impact of early (vs. late) visual experience on the excitation/inhibition ratio of the human visual cortex.

      Indeed, the correlations we have tested between the E/I ratio and EEG parameters were exploratory, and have been reported as such. The goal of our study was not to compare the effects of early vs. late visual experience. The goal was to study whether early visual experience is necessary for a typical E/I ratio in visual neural circuits. We provided clear evidence in favor of this hypothesis. Thus, the present results suggest the necessity of investigating the effects of late visual deprivation. In fact, such research is missing in permanent blindness as well.

      Reviewer #3 (Public Review):

      This manuscript examines the impact of congenital visual deprivation on the excitatory/inhibitory (E/I) ratio in the visual cortex using Magnetic Resonance Spectroscopy (MRS) and electroencephalography (EEG) in individuals whose sight was restored. Ten individuals with reversed congenital cataracts were compared to age-matched, normally sighted controls, assessing the cortical E/I balance and its interrelationship to visual acuity. The study reveals that the Glx/GABA ratio in the visual cortex and the intercept and aperiodic signal are significantly altered in those with a history of early visual deprivation, suggesting persistent neurophysiological changes despite visual restoration.

      My expertise is in EEG (particularly in the decomposition of periodic and aperiodic activity) and statistical methods. I have several major concerns in terms of methodological and statistical approaches along with the (over)interpretation of the results. These major concerns are detailed below.

      (3.1) Variability in visual deprivation:

      - The document states a large variability in the duration of visual deprivation (probably also the age at restoration), with significant implications for the sensitivity period's impact on visual circuit development. The variability and its potential effects on the outcomes need thorough exploration and discussion.

      We work with a rare, unique patient population, which makes it difficult to systematically assess the effects of different visual histories while maintaining stringent inclusion criteria such as complete patterned visual deprivation at birth. Regardless, we considered the large variance in age at surgery and time since surgery as supportive of our interpretation: group differences were found despite the large variance in duration of visual deprivation. Moreover, the existing variance was used to explore possible associations between behavior and neural measures, as well as neurochemical and EEG measures.

      In the revised manuscript, we will detail the advantages and disadvantages of our CC sample, with respect to duration of congenital visual deprivation.

      (3.2) Sample size:

      - The small sample size is a major concern as it may not provide sufficient power to detect subtle effects and/or overestimate significant effects, which then tend not to generalize to new data. One of the biggest drivers of the replication crisis in neuroscience.

      We address the small sample size in our discussion, and make clear that small sample sizes were due to the nature of investigations in special populations. It is worth noting that our EEG results fully align  with those of a larger sample of CC individuals (Ossandón et al., 2023), providing us confidence about their validity and reproducibility. Moreover, our MRS results and correlations of those with EEG parameters were spatially specific to occipital cortex measures, as predicted.

      The main problem with the correlation analyses between MRS and EEG measures is that the sample size is simply too small to conduct such an analysis. Moreover, it is unclear from the methods section that this analysis was only conducted in the patient group (which the reviewer assumed from the plots), and not explained why this was done only in the patient group. I would highly recommend removing these correlation analyses.

      We marked the correlation analyses as exploratory; note that we do not base most of our discussion on the results of these analyses. As indicated by Reviewer 1, reporting them allows for deriving more precise hypothesis for future studies. It has to be noted that we investigate an extremely rare population, tested outside of major developed economies and dedicated neuroimaging research facilities. In addition to being a rare patient group, these individuals come from poor communities. Therefore, we consider it justified to report these correlations as exploratory, providing direction for future research.

      (3.3) Statistical concerns:

      - The statistical analyses, particularly the correlations drawn from a small sample, may not provide reliable estimates (see https://www.sciencedirect.com/science/article/pii/S0092656613000858, which clearly describes this problem).

      It would undoubtedly be better to have a larger sample size. We nonetheless think it is of value to the research community to publish this dataset, since 10 multimodal data sets from a carefully diagnosed, rare population, representing a human model for the effects of early experience on brain development, are quite a lot.  Sample sizes in prior neuroimaging studies in transient blindness have most often ranged from n = 1 to n = 10. They nevertheless provided valuable direction for future research, and integration of results across multiple studies provides scientific insights.  

      Identifying possible group differences was the goal of our study, with the correlations being an exploratory analysis, which we have clearly indicated in the methods, results and discussion.

      - Statistical analyses for the MRS: The authors should consider some additional permutation statistics, which are more suitable for small sample sizes. The current statistical model (2x2) design ANOVA is not ideal for such small sample sizes. Moreover, it is unclear why the condition (EO & EC) was chosen as a predictor and not the brain region (visual & frontal) or neurochemicals. Finally, the authors did not provide any information on the alpha level nor any information on correction for multiple comparisons (in the methods section). Finally, even if the groups are matched w.r.t. age, the time between surgery and measurement, the duration of visual deprivation, (and sex?), these should be included as covariates as it has been shown that these are highly related to the measurements of interest (especially for the EEG measurements) and the age range of the current study is large.

      In our ANOVA models, the neurochemicals were the outcome variables, and the conditions were chosen as predictors based on prior work suggesting that Glx/GABA+ might vary with eye closure (Kurcyus et al., 2018). The study was designed based on a hypothesis of group differences localized to the occipital cortex, due to visual deprivation. The frontal cortex voxel was chosen to indicate whether these differences were spatially specific. Therefore, we conducted separate ANOVAs based on this study design.

      In the revised manuscript, we will add permutation analyses for our outcomes, as well as multiple regression models investigating whether the variance in visual history might have driven these results. Note that in the supplementary materials (S6, S7), we have reported the correlations between visual history metrics and MRS/EEG outcomes.

      The alpha level used for the ANOVA models specified in the methods section was 0.05. The alpha level for the exploratory analyses reported in the main manuscript was 0.008, after correcting for (6) multiple comparisons using the Bonferroni correction, also specified in the methods. Note that the p-values following correction are expressed as multiplied by 6, due to most readers assuming an alpha level of 0.05 (see response regarding large p-values).

      We used a control group matched for age and sex. Moreover, the controls were recruited and tested in the same institutes, using the same setup. We feel that we followed the gold standards for recruiting a healthy control group for a patient group.

      - EEG statistical analyses: The same critique as for the MRS statistical analyses applies to the EEG analysis. In addition: was the 2x3 ANOVA conducted for EO and EC independently? This seems to be inconsistent with the approach in the MRS analyses, in which the authors chose EO & EC as predictors in their 2x2 ANOVA.

      The 2x3 ANOVA was not conducted independently for the eyes open/eyes closed condition, the ANOVA conducted on the EEG metrics was 2x3 because it had group (CC, SC) and condition (eyes open (EO), eyes closed (EC) and visual stimulation (LU)) as predictors.

      - Figure 4: The authors report a p-value of >0.999 with a correlation coefficient of -0.42 with a sample size of 10 subjects. This can't be correct (it should be around: p = 0.22). All statistical analyses should be checked.

      As specified in the methods and figure legend, the reported p values in Figure 4 have been corrected using the Bonferroni correction, and therefore multiplied by the number of comparisons, leading to the seemingly large values.

      Additionally, to check all statistical analyses, we put the manuscript through an independent Statistics Check (Nuijten & Polanin, 2020) (https://michelenuijten.shinyapps.io/statcheck-web/) and will upload the consistency report with the revised supplementary material.

      - Figure 2c. Eyes closed condition: The highest score of the *Glx/GABA ratio seems to be ~3.6. In subplot 2a, there seem to be 3 subjects that show a Glx/GABA ratio score > 3.6. How can this be explained? There is also a discrepancy for the eyes-closed condition.

      The three subjects that show the Glx/GABA+ ratio > 3.6 in subplot 2a are in the SC group, whereas the correlations plotted in figure 2c are only for the CC group, where the highest score is indeed ~3.6.

      (3.4) Interpretation of aperiodic signal:

      - Several recent papers demonstrated that the aperiodic signal measured in EEG or ECoG is related to various important aspects such as age, skull thickness, electrode impedance, as well as cognition. Thus, currently, very little is known about the underlying effects which influence the aperiodic intercept and slope. The entire interpretation of the aperiodic slope as a proxy for E/I is based on a computational model and simulation (as described in the Gao et al. paper).

      Apart from the modeling work from Gao et al., multiple papers which have also been cited which used ECoG, EEG and MEG and showed concomitant changes in aperiodic activity with pharmacological manipulation of the E/I ratio (Colombo et al., 2019; Molina et al., 2020; Muthukumaraswamy & Liley, 2018). Further, several prior studies have interpreted changes in the aperiodic slope as reflective of changes in the E/I ratio, including studies of developmental groups (Favaro et al., 2023; Hill et al., 2022; McSweeney et al., 2023; Schaworonkow & Voytek, 2021) as well as patient groups (Molina et al., 2020; Ostlund et al., 2021).

      In the revised manuscript, we will cite those studies not already included in the introduction.

      - Especially the aperiodic intercept is a very sensitive measure to many influences (e.g. skull thickness, electrode impedance...). As crucial results (correlation aperiodic intercept and MRS measures) are facing this problem, this needs to be reevaluated. It is safer to make statements on the aperiodic slope than intercept. In theory, some of the potentially confounding measures are available to the authors (e.g. skull thickness can be computed from T1w images; electrode impedances are usually acquired alongside the EEG data) and could be therefore controlled.

      All electrophysiological measures indeed depend on parameters such as skull thickness and electrode impedance. As in the extant literature using neurophysiological measures to compare brain function between patient and control groups, we used a control group matched in age/ sex, recruited in the same region, tested with the same devices, and analyzed with the same analysis pipeline. For example, impedance was kept below 10 kOhm for all subjects. There is no evidence available suggesting that congenital cataracts are associated with changes in skull thickness that would cause the observed pattern of group results. Moreover, we cannot think of how any of the exploratory correlations between neurophysiological measures and MRS measures could be accounted for by a difference e.g. in skull thickness.

      - The authors wrote: "Higher frequencies (such as 20-40 Hz) have been predominantly associated with local circuit activity and feedforward signaling (Bastos et al., 2018; Van Kerkoerle et al., 2014); the increased 20-40 Hz slope may therefore signal increased spontaneous spiking activity in local networks. We speculate that the steeper slope of the aperiodic activity for the lower frequency range (1-20 Hz) in CC individuals reflects the concomitant increase in inhibition." The authors confuse the interpretation of periodic and aperiodic signals. This section refers to the interpretation of the periodic signal (higher frequencies). This interpretation cannot simply be translated to the aperiodic signal (slope).

      Prior work has not always separated the aperiodic and periodic components, making it unclear what might have driven these effects in our data. The interpretation of the higher frequency range was intended to contrast with the interpretations of lower frequency range, in order to speculate as to why the two aperiodic fits might go in differing directions. We will clarify our interpretation in the revised manuscript. Note that Ossandon et al. reported highly similar results (group differences for CC individuals and for permanently congenitally blind humans) for the aperiodic activity between 20-40 Hz and oscillatory activity in the gamma range. We will allude to these findings in the revised manuscript.

      - The authors further wrote: We used the slope of the aperiodic (1/f) component of the EEG spectrum as an estimate of E/I ratio (Gao et al., 2017; Medel et al., 2020; Muthukumaraswamy & Liley, 2018). This is a highly speculative interpretation with very little empirical evidence. These papers were conducted with ECoG data (mostly in animals) and mostly under anesthesia. Thus, these studies only allow an indirect interpretation by what the 1/f slope in EEG measurements is actually influenced.

      Note that Muthukumaraswamy et al. (2018) used different types of pharmacological manipulations and analyzed periodic and aperiodic MEG activity in addition to monkey ECoG (Medel et al., 2020) (now published as (Medel et al., 2023)) compared EEG activity in addition to ECoG data after propofol administration. The interpretation of our results are in line with a number of recent studies in developing (Hill et al., 2022; Schaworonkow & Voytek, 2021) and special populations using EEG. As mentioned above, several prior studies have used the slope of the 1/f component/aperiodic activity as an indirect measure of the E/I ratio (Favaro et al., 2023; Hill et al., 2022; McSweeney et al., 2023; Molina et al., 2020; Ostlund et al., 2021; Schaworonkow & Voytek, 2021), including studies using scalp-recorded EEG. We will make more clear in the introduction of the revised manuscript that this metric is indirect.

      While a full understanding of aperiodic activity needs to be provided, some convergent ideas have emerged . We think that our results contribute to this enterprise, since our study is, to the best of our knowledge, the first which assessed MRS measured neurotransmitter levels and EEG aperiodic activity.

      (3.5) Problems with EEG preprocessing and analysis:

      - It seems that the authors did not identify bad channels nor address the line noise issue (even a problem if a low pass filter of below-the-line noise was applied).

      As pointed out in the methods and Figure 1, we only analyzed data from two channels, O1 and O2, neither of which were rejected for any participant. Channel rejection was performed for the larger dataset, published elsewhere (Ossandón et al., 2023; Pant et al., 2023).

      In both published works, we did not consider frequency ranges above 40 Hz to avoid any possible contamination with line noise. Here, we focused on activity between 0 and 20 Hz, definitely excluding line noise contaminations. The low pass filter (FIR, 1-45 Hz) guaranteed that any spill-over effects of line noise would be restricted to frequencies just below the upper cutoff frequency.

      Additionally, a prior version of the analysis used the cleanline.m function to remove line noise before filtering, and the group differences remained stable. We will report this analysis in the supplementary version of the revised manuscript. Further, both groups were measured in the same lab, making line noise as an account for the observed group effects highly unlikely. Finally, any of the exploratory MRS-EEG correlations would be hard to explain if the EEG parameters would be contaminated with line noise.

      - What was the percentage of segments that needed to be rejected due to the 120μV criteria? This should be reported specifically for EO & EC and controls and patients.

      The mean percentage of 1 second segments rejected for each resting state condition is below. Mean percentage of 6.25 long segments rejected in each group for the visual stimulation condition are also included, and will be added to the revised manuscript:

      Author response table 3.

      - The authors downsampled the data to 60Hz to "to match the stimulation rate". What is the intention of this? Because the subsequent spectral analyses are conflated by this choice (see Nyquist theorem).

      This data were collected as part of a study designed to evoke alpha activity with visual white-noise, which ranged in luminance with equal power at all frequencies from 1-60 Hz, restricted by the refresh rate of the monitor on which stimuli were presented (Pant et al., 2023). This paradigm and method was developed by VanRullen and colleagues (Schwenk et al., 2020; Vanrullen & MacDonald, 2012), wherein the analysis requires the same sampling rate between the presented frequencies and the EEG data. The downsampling function used here automatically applies an anti-aliasing filter (EEGLAB 2019) .

      - "Subsequently, baseline removal was conducted by subtracting the mean activity across the length of an epoch from every data point." The actual baseline time segment should be specified.

      The time segment was the length of the epoch, that is, 1 second for the resting state conditions and 6.25 seconds for the visual stimulation conditions. This will be explicitly stated in the revised manuscript.

      - "We excluded the alpha range (8-14 Hz) for this fit to avoid biasing the results due to documented differences in alpha activity between CC and SC individuals (Bottari et al., 2016; Ossandón et al., 2023; Pant et al., 2023)." This does not really make sense, as the FOOOF algorithm first fits the 1/f slope, for which the alpha activity is not relevant.

      We did not use the FOOOF algorithm/toolbox in this manuscript. As stated in the methods, we used a 1/f fit to the 1-20 Hz spectrum in the log-log space, and subtracted this fit from the original spectrum to obtain the corrected spectrum. Given the pronounced difference in alpha power between groups (Bottari et al., 2016; Ossandón et al., 2023; Pant et al., 2023), we were concerned it might drive differences in the exponent values.  Our analysis pipeline had been adapted from previous publications of our group and other labs (Ossandón et al., 2023; Voytek et al., 2015; Waschke et al., 2017).

      We have conducted the analysis with and without the exclusion of the alpha range, as well as using the FOOOF toolbox both in the 1-20 Hz and 20-40 Hz ranges (Ossandón et al., 2023); The findings of a steeper slope in the 1-20 Hz range as well as lower alpha power in CC vs SC individuals remained stable. In Ossandón et al., the comparison between the piecewise fits and FOOOF fits led the authors to use the former as it outperformed the FOOOF algorithm for their data.

      - The model fits of the 1/f fitting for EO, EC, and both participant groups should be reported.

      In Figure 3 of the manuscript, we depicted the mean spectra and 1/f fits for each group. We will add the fit quality metrics and show individual subjects’ fits in the revised manuscript.

      (3.6) Validity of GABA measurements and results:

      - According the a newer study by the authors of the Gannet toolbox (https://analyticalsciencejournals.onlinelibrary.wiley.com/doi/abs/10.1002/nbm.5076), the reliability and reproducibility of the gamma-aminobutyric acid (GABA) measurement can vary significantly depending on acquisition and modeling parameter. Thus, did the author address these challenges?

      We took care of data quality while acquiring MRS data by ensuring appropriate voxel placement and linewidth prior to scanning. Acquisition as well as modeling parameters were constant for both groups, so they cannot have driven group differences.

      The linked article compares the reproducibility of GABA measurement using Osprey, which was released in 2020 and uses linear combination modeling to fit the peak as opposed to Gannet’s simple peak fitting (Hupfeld et al., 2024). The study finds better test-retest reliability for Osprey compared to Gannet’s method.

      As the present work was conceptualized in 2018, we used Gannet 3.0, which was the state-of-the-art edited spectral analysis toolbox at the time, and still is widely used. In the revised manuscript, we will include a supplementary section reanalyzing the main findings with Osprey.

      - Furthermore, the authors wrote: "We confirmed the within-subject stability of metabolite quantification by testing a subset of the sighted controls (n=6) 2-4 weeks apart. Looking at the supplementary Figure 5 (which would be rather plotted as ICC or Blant-Altman plots), the within-subject stability compared to between-subject variability seems not to be great. Furthermore, I don't think such a small sample size qualifies for a rigorous assessment of stability.

      Indeed, we did not intend to provide a rigorous assessment of within-subject stability. Rather, we aimed to confirm that data quality/concentration ratios did not systematically differ between the same subjects tested longitudinally; driven, for example, by scanner heating or time of day. As with the phantom testing, we attempted to give readers an idea of the quality of the data, as they were collected from a primarily clinical rather than a research site.

      In the revised manuscript we will remove the statement regarding stability, and add the Blant-Altman plot.

      - "Why might an enhanced inhibitory drive, as indicated by the lower Glx/GABA ratio" Is this interpretation really warranted, as the results of the group differences in the Glx/GABA ratio seem to be rather driven by a decreased Glx concentration in CC rather than an increased GABA (see Figure 2).

      We used the Glx/GABA+ ratio as a measure, rather than individual Glx or GABA+ concentration, which did not significantly differ between groups. As detailed in Response 2.2, we think this metric aligns better with an underlying E/I balance hypothesis and has been used in many previous studies (Gao et al., 2024; Liu et al., 2015; Narayan et al., 2022; Perica et al., 2022).

      Our interpretation of an enhanced inhibitory drive additionally comes from the combination of aperiodic EEG (1-20 Hz) and MRS measures, which, when considered together, are consistent with a decreased E/I ratio.

      In the revised manuscript, we will rephrase this sentence accordingly. 

      - Glx concentration predicted the aperiodic intercept in CC individuals' visual cortices during ambient and flickering visual stimulation. Why specifically investigate the Glx concentration, when the paper is about E/I ratio?

      As stated in the methods, we exploratorily assessed the relationship between all MRS parameters (Glx, GABA+ and Glx/GABA+ ratio) with the aperiodic parameters (slope, offset), and corrected for multiple comparisons accordingly. We think this is a worthwhile analysis considering the rarity of the dataset/population (see 1.2, 1.6, 2.1 and reviewer 1’s comments about future hypotheses). We only report the Glx – aperiodic intercept correlation in the main manuscript as it survived correction for multiple comparisons.

      (3.7) Interpretation of the correlation between MRS measurements and EEG aperiodic signal:

      - The authors wrote: "The intercept of the aperiodic activity was highly correlated with the Glx concentration during rest with eyes open and during flickering stimulation (also see Supplementary Material S11). Based on the assumption that the aperiodic intercept reflects broadband firing (Manning et al., 2009; Winawer et al., 2013), this suggests that the Glx concentration might be related to broadband firing in CC individuals during active and passive visual stimulation." These results should not be interpreted (or with very caution) for several reasons (see also problem with influences on aperiodic intercept and small sample size). This is a result of the exploratory analyses of correlating every EEG parameter with every MRS parameter. This requires well-powered replication before any interpretation can be provided. Furthermore and importantly: why should this be specifically only in CC patients, but not in the SC control group?

      We indicate clearly in all parts of the manuscript that these correlations are presented as exploratory. Further, we interpret the Glx-aperiodic offset correlation, and none of the others, as it survived the Bonferroni correction for multiple comparisons. We offer a hypothesis in the discussion section as to why such a correlation might exist in the CC but not the SC group (see response 2.2), and do not speculate further.

      (3.8) Language and presentation:

      - The manuscript requires language improvements and correction of numerous typos. Over-simplifications and unclear statements are present, which could mislead or confuse readers (see also interpretation of aperiodic signal).

      In the revision, we will check that speculations are clearly marked and typos are removed.

      - The authors state that "Together, the present results provide strong evidence for experience-dependent development of the E/I ratio in the human visual cortex, with consequences for behavior." The results of the study do not provide any strong evidence, because of the small sample size and exploratory analyses approach and not accounting for possible confounding factors.

      We disagree with this statement and allude to convergent evidence of both MRS and neurophysiological measures. The latter link to corresponding results observed in a larger sample of CC individuals (Ossandón et al., 2023).

      - "Our results imply a change in neurotransmitter concentrations as a consequence of *restoring* vision following congenital blindness." This is a speculative statement to infer a causal relationship on cross-sectional data.

      As mentioned under 2.1, we conducted a cross-sectional study which might justify future longitudinal work. In order to advance science, new testable hypotheses were put forward at the end of a manuscript.

      In the revised manuscript we will add “might imply” to better indicate the hypothetical character of this idea.

      - In the limitation section, the authors wrote: "The sample size of the present study is relatively high for the rare population , but undoubtedly, overall, rather small." This sentence should be rewritten, as the study is plein underpowered. The further justification "We nevertheless think that our results are valid. Our findings neurochemically (Glx and GABA+ concentration), and anatomically (visual cortex) specific. The MRS parameters varied with parameters of the aperiodic EEG activity and visual acuity. The group differences for the EEG assessments corresponded to those of a larger sample of CC individuals (n=38) (Ossandón et al., 2023), and effects of chronological age were as expected from the literature." These statements do not provide any validation or justification of small samples. Furthermore, the current data set is a subset of an earlier published paper by the same authors "The EEG data sets reported here were part of data published earlier (Ossandón et al., 2023; Pant et al., 2023)." Thus, the statement "The group differences for the EEG assessments corresponded to those of a larger sample of CC individuals (n=38) " is a circular argument and should be avoided.

      Our intention was not to justify having a small sample, but to justify why we think the results might be valid as they align with/replicate existing literature.

      In the revised manuscript, we will add a figure showing that the EEG results of the 10 subjects considered here correspond to those of the 28 other subjects of Ossandon et al. We will adapt the text accordingly, clearly stating that the pattern of EEG results of the ten subjects reported here replicate those of the 28 additional subjects of Ossandon et al. (2023).

      References

      Barnes, S. J., Sammons, R. P., Jacobsen, R. I., Mackie, J., Keller, G. B., & Keck, T. (2015). Subnetwork-specific homeostatic plasticity in mouse visual cortex in vivo. Neuron, 86(5), 1290–1303. https://doi.org/10.1016/J.NEURON.2015.05.010

      Bernabeu, A., Alfaro, A., García, M., & Fernández, E. (2009). Proton magnetic resonance spectroscopy (1H-MRS) reveals the presence of elevated myo-inositol in the occipital cortex of blind subjects. NeuroImage, 47(4), 1172–1176. https://doi.org/10.1016/j.neuroimage.2009.04.080

      Bottari, D., Troje, N. F., Ley, P., Hense, M., Kekunnaya, R., & Röder, B. (2016). Sight restoration after congenital blindness does not reinstate alpha oscillatory activity in humans. Scientific Reports. https://doi.org/10.1038/srep24683

      Colombo, M. A., Napolitani, M., Boly, M., Gosseries, O., Casarotto, S., Rosanova, M., Brichant, J. F., Boveroux, P., Rex, S., Laureys, S., Massimini, M., Chieregato, A., & Sarasso, S. (2019). The spectral exponent of the resting EEG indexes the presence of consciousness during unresponsiveness induced by propofol, xenon, and ketamine. NeuroImage, 189(September 2018), 631–644. https://doi.org/10.1016/j.neuroimage.2019.01.024

      Consideration of Sample Size in Neuroscience Studies. (2020). Journal of Neuroscience, 40(21), 4076–4077. https://doi.org/10.1523/JNEUROSCI.0866-20.2020

      Coullon, G. S. L., Emir, U. E., Fine, I., Watkins, K. E., & Bridge, H. (2015). Neurochemical changes in the pericalcarine cortex in congenital blindness attributable to bilateral anophthalmia. Journal of Neurophysiology. https://doi.org/10.1152/jn.00567.2015

      Fang, Q., Li, Y. T., Peng, B., Li, Z., Zhang, L. I., & Tao, H. W. (2021). Balanced enhancements of synaptic excitation and inhibition underlie developmental maturation of receptive fields in the mouse visual cortex. Journal of Neuroscience, 41(49), 10065–10079. https://doi.org/10.1523/JNEUROSCI.0442-21.2021

      Favaro, J., Colombo, M. A., Mikulan, E., Sartori, S., Nosadini, M., Pelizza, M. F., Rosanova, M., Sarasso, S., Massimini, M., & Toldo, I. (2023). The maturation of aperiodic EEG activity across development reveals a progressive differentiation of wakefulness from sleep. NeuroImage, 277. https://doi.org/10.1016/J.NEUROIMAGE.2023.120264

      Gao, Y., Liu, Y., Zhao, S., Liu, Y., Zhang, C., Hui, S., Mikkelsen, M., Edden, R. A. E., Meng, X., Yu, B., & Xiao, L. (2024). MRS study on the correlation between frontal GABA+/Glx ratio and abnormal cognitive function in medication-naive patients with narcolepsy. Sleep Medicine, 119, 1–8. https://doi.org/10.1016/j.sleep.2024.04.004

      Haider, B., Duque, A., Hasenstaub, A. R., & McCormick, D. A. (2006). Neocortical network activity in vivo is generated through a dynamic balance of excitation and inhibition. Journal of Neuroscience. https://doi.org/10.1523/JNEUROSCI.5297-05.2006

      Hill, A. T., Clark, G. M., Bigelow, F. J., Lum, J. A. G., & Enticott, P. G. (2022). Periodic and aperiodic neural activity displays age-dependent changes across early-to-middle childhood. Developmental Cognitive Neuroscience, 54, 101076. https://doi.org/10.1016/J.DCN.2022.101076

      Hupfeld, K. E., Zöllner, H. J., Hui, S. C. N., Song, Y., Murali-Manohar, S., Yedavalli, V., Oeltzschner, G., Prisciandaro, J. J., & Edden, R. A. E. (2024). Impact of acquisition and modeling parameters on the test–retest reproducibility of edited GABA+. NMR in Biomedicine, 37(4), e5076. https://doi.org/10.1002/nbm.5076

      Hyvärinen, J., Carlson, S., & Hyvärinen, L. (1981). Early visual deprivation alters modality of neuronal responses in area 19 of monkey cortex. Neuroscience Letters, 26(3), 239–243. https://doi.org/10.1016/0304-3940(81)90139-7

      Juchem, C., & Graaf, R. A. de. (2017). B0 magnetic field homogeneity and shimming for in vivo magnetic resonance spectroscopy. Analytical Biochemistry, 529, 17–29. https://doi.org/10.1016/j.ab.2016.06.003

      Keck, T., Hübener, M., & Bonhoeffer, T. (2017). Interactions between synaptic homeostatic mechanisms: An attempt to reconcile BCM theory, synaptic scaling, and changing excitation/inhibition balance. Current Opinion in Neurobiology, 43, 87–93. https://doi.org/10.1016/J.CONB.2017.02.003

      Kurcyus, K., Annac, E., Hanning, N. M., Harris, A. D., Oeltzschner, G., Edden, R., & Riedl, V. (2018). Opposite Dynamics of GABA and Glutamate Levels in the Occipital Cortex during Visual Processing. Journal of Neuroscience, 38(46), 9967–9976. https://doi.org/10.1523/JNEUROSCI.1214-18.2018

      Liu, B., Wang, G., Gao, D., Gao, F., Zhao, B., Qiao, M., Yang, H., Yu, Y., Ren, F., Yang, P., Chen, W., & Rae, C. D. (2015). Alterations of GABA and glutamate-glutamine levels in premenstrual dysphoric disorder: A 3T proton magnetic resonance spectroscopy study. Psychiatry Research - Neuroimaging, 231(1), 64–70. https://doi.org/10.1016/J.PSCYCHRESNS.2014.10.020

      Lunghi, C., Berchicci, M., Morrone, M. C., & Russo, F. D. (2015). Short‐term monocular deprivation alters early components of visual evoked potentials. The Journal of Physiology, 593(19), 4361. https://doi.org/10.1113/JP270950

      Maier, S., Düppers, A. L., Runge, K., Dacko, M., Lange, T., Fangmeier, T., Riedel, A., Ebert, D., Endres, D., Domschke, K., Perlov, E., Nickel, K., & Tebartz van Elst, L. (2022). Increased prefrontal GABA concentrations in adults with autism spectrum disorders. Autism Research, 15(7), 1222–1236. https://doi.org/10.1002/aur.2740

      Manning, J. R., Jacobs, J., Fried, I., & Kahana, M. J. (2009). Broadband shifts in local field potential power spectra are correlated with single-neuron spiking in humans. The Journal of Neuroscience : The Official Journal of the Society for Neuroscience, 29(43), 13613–13620. https://doi.org/10.1523/JNEUROSCI.2041-09.2009

      McSweeney, M., Morales, S., Valadez, E. A., Buzzell, G. A., Yoder, L., Fifer, W. P., Pini, N., Shuffrey, L. C., Elliott, A. J., Isler, J. R., & Fox, N. A. (2023). Age-related trends in aperiodic EEG activity and alpha oscillations during early- to middle-childhood. NeuroImage, 269, 119925. https://doi.org/10.1016/j.neuroimage.2023.119925

      Medel, V., Irani, M., Crossley, N., Ossandón, T., & Boncompte, G. (2023). Complexity and 1/f slope jointly reflect brain states. Scientific Reports, 13(1), 21700. https://doi.org/10.1038/s41598-023-47316-0

      Medel, V., Irani, M., Ossandón, T., & Boncompte, G. (2020). Complexity and 1/f slope jointly reflect cortical states across different E/I balances. bioRxiv, 2020.09.15.298497. https://doi.org/10.1101/2020.09.15.298497

      Molina, J. L., Voytek, B., Thomas, M. L., Joshi, Y. B., Bhakta, S. G., Talledo, J. A., Swerdlow, N. R., & Light, G. A. (2020). Memantine Effects on Electroencephalographic Measures of Putative Excitatory/Inhibitory Balance in Schizophrenia. Biological Psychiatry: Cognitive Neuroscience and Neuroimaging, 5(6), 562–568. https://doi.org/10.1016/j.bpsc.2020.02.004

      Mukerji, A., Byrne, K. N., Yang, E., Levi, D. M., & Silver, M. A. (2022). Visual cortical γ−aminobutyric acid and perceptual suppression in amblyopia. Frontiers in Human Neuroscience, 16. https://doi.org/10.3389/fnhum.2022.949395

      Muthukumaraswamy, S. D., & Liley, D. T. (2018). 1/F electrophysiological spectra in resting and drug-induced states can be explained by the dynamics of multiple oscillatory relaxation processes. NeuroImage, 179(November 2017), 582–595. https://doi.org/10.1016/j.neuroimage.2018.06.068

      Narayan, G. A., Hill, K. R., Wengler, K., He, X., Wang, J., Yang, J., Parsey, R. V., & DeLorenzo, C. (2022). Does the change in glutamate to GABA ratio correlate with change in depression severity? A randomized, double-blind clinical trial. Molecular Psychiatry, 27(9), 3833—3841. https://doi.org/10.1038/s41380-022-01730-4

      Nuijten, M. B., & Polanin, J. R. (2020). “statcheck”: Automatically detect statistical reporting inconsistencies to increase reproducibility of meta-analyses. Research Synthesis Methods, 11(5), 574–579. https://doi.org/10.1002/jrsm.1408

      Ossandón, J. P., Stange, L., Gudi-Mindermann, H., Rimmele, J. M., Sourav, S., Bottari, D., Kekunnaya, R., & Röder, B. (2023). The development of oscillatory and aperiodic resting state activity is linked to a sensitive period in humans. NeuroImage, 275, 120171. https://doi.org/10.1016/J.NEUROIMAGE.2023.120171

      Ostlund, B. D., Alperin, B. R., Drew, T., & Karalunas, S. L. (2021). Behavioral and cognitive correlates of the aperiodic (1/f-like) exponent of the EEG power spectrum in adolescents with and without ADHD. Developmental Cognitive Neuroscience, 48, 100931. https://doi.org/10.1016/j.dcn.2021.100931

      Pant, R., Ossandón, J., Stange, L., Shareef, I., Kekunnaya, R., & Röder, B. (2023). Stimulus-evoked and resting-state alpha oscillations show a linked dependence on patterned visual experience for development. NeuroImage: Clinical, 103375. https://doi.org/10.1016/J.NICL.2023.103375

      Perica, M. I., Calabro, F. J., Larsen, B., Foran, W., Yushmanov, V. E., Hetherington, H., Tervo-Clemmens, B., Moon, C.-H., & Luna, B. (2022). Development of frontal GABA and glutamate supports excitation/inhibition balance from adolescence into adulthood. Progress in Neurobiology, 219, 102370. https://doi.org/10.1016/j.pneurobio.2022.102370

      Pitchaimuthu, K., Wu, Q. Z., Carter, O., Nguyen, B. N., Ahn, S., Egan, G. F., & McKendrick, A. M. (2017). Occipital GABA levels in older adults and their relationship to visual perceptual suppression. Scientific Reports, 7(1). https://doi.org/10.1038/S41598-017-14577-5

      Rideaux, R., Ehrhardt, S. E., Wards, Y., Filmer, H. L., Jin, J., Deelchand, D. K., Marjańska, M., Mattingley, J. B., & Dux, P. E. (2022). On the relationship between GABA+ and glutamate across the brain. NeuroImage, 257, 119273. https://doi.org/10.1016/J.NEUROIMAGE.2022.119273

      Schaworonkow, N., & Voytek, B. (2021). Longitudinal changes in aperiodic and periodic activity in electrophysiological recordings in the first seven months of life. Developmental Cognitive Neuroscience, 47. https://doi.org/10.1016/j.dcn.2020.100895

      Schwenk, J. C. B., VanRullen, R., & Bremmer, F. (2020). Dynamics of Visual Perceptual Echoes Following Short-Term Visual Deprivation. Cerebral Cortex Communications, 1(1). https://doi.org/10.1093/TEXCOM/TGAA012

      Sengpiel, F., Jirmann, K.-U., Vorobyov, V., & Eysel, U. T. (2006). Strabismic Suppression Is Mediated by Inhibitory Interactions in the Primary Visual Cortex. Cerebral Cortex, 16(12), 1750–1758. https://doi.org/10.1093/cercor/bhj110

      Steel, A., Mikkelsen, M., Edden, R. A. E., & Robertson, C. E. (2020). Regional balance between glutamate+glutamine and GABA+ in the resting human brain. NeuroImage, 220. https://doi.org/10.1016/J.NEUROIMAGE.2020.117112

      Takado, Y., Takuwa, H., Sampei, K., Urushihata, T., Takahashi, M., Shimojo, M., Uchida, S., Nitta, N., Shibata, S., Nagashima, K., Ochi, Y., Ono, M., Maeda, J., Tomita, Y., Sahara, N., Near, J., Aoki, I., Shibata, K., & Higuchi, M. (2022). MRS-measured glutamate versus GABA reflects excitatory versus inhibitory neural activities in awake mice. Journal of Cerebral Blood Flow & Metabolism, 42(1), 197. https://doi.org/10.1177/0271678X211045449

      Takei, Y., Fujihara, K., Tagawa, M., Hironaga, N., Near, J., Kasagi, M., Takahashi, Y., Motegi, T., Suzuki, Y., Aoyama, Y., Sakurai, N., Yamaguchi, M., Tobimatsu, S., Ujita, K., Tsushima, Y., Narita, K., & Fukuda, M. (2016). The inhibition/excitation ratio related to task-induced oscillatory modulations during a working memory task: A multtimodal-imaging study using MEG and MRS. NeuroImage, 128, 302–315. https://doi.org/10.1016/J.NEUROIMAGE.2015.12.057

      Tao, H. W., & Poo, M. M. (2005). Activity-dependent matching of excitatory and inhibitory inputs during refinement of visual receptive fields. Neuron, 45(6), 829–836. https://doi.org/10.1016/J.NEURON.2005.01.046

      Vanrullen, R., & MacDonald, J. S. P. (2012). Perceptual echoes at 10 Hz in the human brain. Current Biology. https://doi.org/10.1016/j.cub.2012.03.050

      Voytek, B., Kramer, M. A., Case, J., Lepage, K. Q., Tempesta, Z. R., Knight, R. T., & Gazzaley, A. (2015). Age-related changes in 1/f neural electrophysiological noise. Journal of Neuroscience, 35(38). https://doi.org/10.1523/JNEUROSCI.2332-14.2015

      Vreeswijk, C. V., & Sompolinsky, H. (1996). Chaos in neuronal networks with balanced excitatory and inhibitory activity. Science, 274(5293), 1724–1726. https://doi.org/10.1126/SCIENCE.274.5293.1724

      Waschke, L., Wöstmann, M., & Obleser, J. (2017). States and traits of neural irregularity in the age-varying human brain. Scientific Reports 2017 7:1, 7(1), 1–12. https://doi.org/10.1038/s41598-017-17766-4

      Weaver, K. E., Richards, T. L., Saenz, M., Petropoulos, H., & Fine, I. (2013). Neurochemical changes within human early blind occipital cortex. Neuroscience. https://doi.org/10.1016/j.neuroscience.2013.08.004

      Wu, Y. K., Miehl, C., & Gjorgjieva, J. (2022). Regulation of circuit organization and function through inhibitory synaptic plasticity. Trends in Neurosciences, 45(12), 884–898. https://doi.org/10.1016/J.TINS.2022.10.006

    1. Author response:

      Reviewer #1 (Public review):

      (1) Legionella effectors are often activated by binding to eukaryote-specific host factors, including actin. The authors should test the following: a) whether Lfat1 can fatty acylate small G-proteins in vitro; b) whether this activity is dependent on actin binding; and c) whether expression of the Y240A mutant in mammalian cells affects the fatty acylation of Rac3 (Figure 6B), or other small G-proteins.

      We were not able to express and purify the full-length recombinant Lfat1 to perform fatty acylation of small GTPases in vitro. However, in cellulo overexpression of the Y240A mutant still retained ability to fatty acylate Rac3 and another small GTPase RheB (see Author response image 1 below). We postulate that under infection conditions, actin-binding might be required to fatty acylate certain GTPases due to the small amount of effector proteins that secreted into the host cell.

      Author response image 1.

      (2) It should be demonstrated that lysine residues on small G-proteins are indeed targeted by Lfat1. Ideally, the functional consequences of these modifications should also be investigated. For example, does fatty acylation of G-proteins affect GTPase activity or binding to downstream effectors?

      We have mutated K178 on RheB and showed that this mutation abolished its fatty acylation by Lfat1 (see Author response image 2 below). We were not able to test if fatty acylation by Lfat1 affect downstream effector binding.

      Author response image 2.

      (3) Line 138: Can the authors clarify whether the Lfat1 ABD induces bundling of F-actin filaments or promotes actin oligomerization? Does the Lfat1 ABD form multimers that bring multiple filaments together? If Lfat1 induces actin oligomerization, this effect should be experimentally tested and reported. Additionally, the impact of Lfat1 binding on actin filament stability should be assessed. This is particularly important given the proposed use of the ABD as an actin probe.

      The ABD domain does not form oligomer as evidenced by gel filtration profile of the ABD domain. However, we do see F-actin bundling in our in vitro -F-actin polymerization experiment when both actin and ABD are in high concentration (data not shown). Under low concentration of ABD, there is not aggregation/bundling effect of F-actin.

      (4) Line 180: I think it's too premature to refer to the interaction as having "high specificity and affinity." We really don't know what else it's binding to.

      We have revised the text and reworded the sentence by removing "high specificity and affinity."

      (5) The authors should reconsider the color scheme used in the structural figures, particularly in Figures 2D and S4.

      Not sure the comments on the color scheme of the structure figures.

      (6) In Figure 3E, the WT curve fits the data poorly, possibly because the actin concentration exceeds the Kd of the interaction. It might fit better to a quadratic.

      We have performed quadratic fitting and replaced Figure 3E.

      (7) The authors propose that the individual helices of the Lfat1 ABD could be expressed on separate proteins and used to target multi-component biological complexes to F-actin by genetically fusing each component to a split alpha-helix. This is an intriguing idea, but it should be tested as a proof of concept to support its feasibility and potential utility.

      It is a good suggestion. We plan to thoroughly test the feasibility of this idea as one of our future directions.

      (7) The plot in Figure S2D appears cropped on the X-axis or was generated from a ~2× binned map rather than the deposited one (pixel size ~0.83 Å, plot suggests ~1.6 Å). The reported pixel size is inconsistent between the Methods and Table 1-please clarify whether 0.83 Å refers to super-resolution.

      Yes, 0.83 Å is super-resolution. We have updated in the cryoEM table

      Reviewer #2 (Public review):

      Weaknesses:

      (1) The authors should use biochemical reactions to analyze the KFAT of Llfat1 on one or two small GTPases shown to be modified by this effector in cellulo. Such reactions may allow them to determine the role of actin binding in its biochemical activity. This notion is particularly relevant in light of recent studies that actin is a co-factor for the activity of LnaB and Ceg14 (PMID: 39009586; PMID: 38776962; PMID: 40394005). In addition, the study should be discussed in the context of these recent findings on the role of actin in the activity of L. pneumophila effectors.

      We have new data showed that Actin binding does not affect Lfat1 enzymatic activity. (see figure; response to Reviewer #1). We have added this new data as Figure S7 to the paper. Accordingly, we also revised the discussion by adding the following paragraph.

      “The discovery of Lfat1 as an F-actin–binding lysine fatty acyl transferase raised the intriguing question of whether its enzymatic activity depends on F-actin binding. Recent studies have shown that other Legionella effectors, such as LnaB and Ceg14, use actin as a co-factor to regulate their activities. For instance, LnaB binds monomeric G-actin to enhance its phosphoryl-AMPylase activity toward phosphorylated residues, resulting in unique ADPylation modifications in host proteins (Fu et al, 2024; Wang et al, 2024). Similarly, Ceg14 is activated by host actin to convert ATP and dATP into adenosine and deoxyadenosine monophosphate, thereby modulating ATP levels in L. pneumophila–infected cells (He et al, 2025). However, this does not appear to be the case for Lfat1. We found that Lfat1 mutants defective in F-actin binding retained the ability to modify host small GTPases when expressed in cells (Figure S7). These findings suggest that, rather than serving as a co-factor, F-actin may serve to localize Lfat1 via its actin-binding domain (ABD), thereby confining its activity to regions enriched in F-actin and enabling spatial specificity in the modification of host targets.”

      (2) The development of the ABD domain of Llfat1 as an F-actin domain is a nice extension of the biochemical and structural experiments. The authors need to compare the new probe to those currently commonly used ones, such as Lifeact, in labeling of the actin cytoskeleton structure.

      We fully agree with the reviewer’s insightful suggestion. However, a direct comparison of the Lfat1 ABD domain with commonly used actin probes such as Lifeact, as well as evaluation of the split α-helix probe (as suggested by Reviewer #1), would require extensive and technically demanding experiments. These are important directions that we plan to pursue in future studies.

    1. R0:

      Reviewer #1: Manuscript as reviewed meets PLOS Global Public Health publication requirements, the author(s) clearly presented the study background, methods, results, discussions and conclusion. My comments and revision request are minor formatting and suggested input. No ethics concerns at this time. Reviewer #2: This is a well-written paper with clear methodology. From the perspective of data science applied to public health, this manuscript does a great job of clearly discussing and defining its methodology, which are all the current best practices. Correcting for class imbalance was a good choice, given the low prevalence of EC in the survey population. The use of SMOTE on the training set only ensured minimal data leakage, and is the current best practice. Using such a large variety of machine learning models creates a challenge in describing each model well enough within one manuscript, and the author did a good job of balancing that challenge.

      I only have a few minor suggestions toc clarify the methodology of the manuscript:

      Please specify upfront how many observations were used in training and testing, and specify how many positive EC outcomes were included in the testing set. With such a low prevalence of a positive outcome in a relatively small set of observations, it is worth mentioning that there are perhaps only 10-20 positive outcomes being predicted in the test set. In the absence of weighting, it may be that characteristics of those few positive outcomes in test set are biasing the predictors, and this is worth mentioning.

      Please discuss how the initial 38 variables were selected from the survey. If there was an initial expert judgment on inclusion into the variable set for feature selection, that should be mentioned.

      Cluster design was mentioned in the PMA survey. This indicates that the survey includes survey weights of some kind. Please discuss whether those weights were addressed in the machine learning methods, or defend why they were not included in the model design. Survey weights can be included in machine learning models to make the predictors more representative of the population of interest.

      In the discussion, please discuss the impact of low precision, where there were many false positives compared to true positives. While it is mentioned, there are consequences (e.g., loss of trust) for low precision prediction models in public health, and this characteristic of the findings could be discussed more.

      Consider including a SHAP dependance plot, because potential interactions are discussed (e.g., knowledge and ad exposure) without showing evidence. A SHAP Dependence plot could take care of this.

      Consider explicitly discussing the limitation of cross-sectional survey data used for prediction, where proxies were used in place of quantitative evidence (e.g., exposure to ads to proxy perceptions).

      Overall, great work, timely, and well constructed. Reviewer #3: SEE word document attached with clear table

      Manuscript Number: PGPH-D-25-01837 Review report

      This manuscript demonstrates a significant strength in its application of advanced machine learning and Explainable AI (XAI) to address the critical public health challenge of low emergency contraceptive (EC) use in Ethiopia. By rigorously testing multiple models and using SMOTE to handle severe class imbalance, it identifies key modifiable predictors like primarily EC awareness and media exposure rather than static socioeconomic factors. The use of SHAP values transforms complex model outputs into actionable insights, revealing that knowledge gaps are the primary barrier. This approach provides a powerful, data-driven blueprint for designing targeted interventions, such as tailored media campaigns and improved health counselling, to effectively increase EC uptake and reduce unintended pregnancies. However, the following points may need to be considered, so as to improve the quality of the paper.

      Topic/ subtopic Issue Suggestions Title: Predicting Utilization of Emergency Contraceptive Usage in Ethiopia and Identifying Its Predictors Using Machine Learning Redundancy. "Utilization" and "Usage" mean the same thing. Predicting the Utilization of Emergency Contraception in Ethiopia and Identifying Its Predictors Using Machine Learning. Affiliation Inconsistent institution name. on page 1 says "College of Medicine Health Science" while first page of manuscript is "College of Health Science". Use consistent affiliation name Abstract "Traditional analyses have struggled to identify complex predictors." For flow, consider: Traditional statistical analyses have struggled to… Abstract "with SMOTE used to address class imbalance" – Grammar: This is a dependent clause. It should be connected to the previous sentence. ..., and the SMOTE was used to address class imbalance. Abstract "Findings highlight that knowledge gaps, not poverty or access, are key barriers to EC use." – Clarity: "access" is vague. Be more specific. ...not poverty or physical access barriers, are key. Introduction Page 3: "moderate’s" Change to moderates ("the way the education level moderate’s religion-based stigma"). Introduction "drives excessive maternal mortality rates of over 500 deaths per 100,000 live births, drives poverty cycles, constrains girls' and women's educational and economic opportunities, and overwhelms poor healthcare infrastructures." – The word "drives" is used twice in close succession. ...contributes to high maternal mortality rates of over 500 deaths per 100,000 live births, perpetuates cycles of poverty, constrains... Introduction "is a central preventive intervention" is a crucial preventive intervention Introduction "the use of EC remains embarrassingly low" "Embarrassingly" is subjective and informal. ...remains critically low. Introduction "tempts women to shun services" Word choice not good. ...pressures women to shun services. Introduction "woefully underserved" Informal. ...significantly underserved. Introduction "yield the predictive resolution necessary" "Resolution" unusual in this context. ...yield the predictive accuracy necessary Introduction "vastness tests for fairness" – Phrase is unclear and likely an error. Correct the phrase to clarity Methods Data Source & Inclusion Criteria: The criteria for selecting the 2,334 women from the larger PMA sample of 8,943 are not explicitly stated. Was it a complete case analysis? This needs clarification as it affects the generalizability of the findings. Clarify if sampling was done or it was a complete case study Methods "The dataset demonstrates low overall missing data prevalence" –"Prevalence" is for diseases outbreaks. The missing data were minimal overall; Methods "offering robust classifier building while preserving real performance measurement." ...facilitating the development of robust classifiers while preserving a realistic assessment of performance. Results "nailing 17 true positives" Informal word choice. ...correctly identifying 17 true positives... Results "It manages this recall strength at the expense of precision, though, which sits at approximately 11%." – "Sits at" is informal. It achieves this high recall at the expense of precision, which was approximately 11%. Results "The most influential positive feature was “heard_emergency”, indicating awareness of emergency services has the greatest influence..." add which . The most influential positive feature was “heard_emergency”, which indicates that awareness of emergency contraception has the greatest influence... Results "This resonates with core assumptions of health behavior theories like the Health Belief Model, which posit perceived knowledge as a harbinger of action." "Harbinger" misused. ...which posit knowledge as a prerequisite for action. Results Page 18: "radio-implemented" Change to radio-delivered or radio-based. Results "Even positive, this reflects continued systemic disincentives documented elsewhere" – Unclear Even not a correct word. Although positively associated, this factor reflects... Results "all the sources of blunting the effect of being in contact with the health system." Grammatically incorrect and unclear. ...all of which blunt the effect of health system contact. Results "One of the thoughtful discoveries of SHAP values was the sizeable negative impact" "Thoughtful" incorrect. A notable discovery from the SHAP analysis was. Results "Isolated use of SMOTE in the training set" – "Isolated" wrong word. Applying SMOTE exclusively to the training set Results "It shifted the ML model from being a prediction device to an analysis tool, not just deciding which features were significant, but the size and sign of their effects, and significantly, potential interactions" Not clear because of parallel verbs. It transformed the ML model from a prediction device into an analytical tool, revealing not only which features were significant but also the magnitude and direction of their effects, as well as potential interactions. Results "Simulation by counterfactual SHAP analysis suggests a hypothetical 30% increase in EC knowledge might boost utilization by approximately 12.7%, a valuable public health gain." The sentence needs clearer explanation. Counterfactual simulation using SHAP values (e.g., calculating the mean impact of increasing the "heard_emergency" feature value) suggested that a 30% increase in EC knowledge could potentially increase utilization by approximately 12.7%, representing a valuable public health gain. Results "Geographic ML modeling over the geographic data would also potentially be able to further optimize resource deployment" Repetition: "Geographic" used twice. Rewrite the sentence for clarity Results "the implied vulnerability evidenced by the 'forced pregnancy' variable (despite missing data concerns) underscore" Not clear as the subject-verb disagreement. .use the word..underscores. Methods Model Selection Justification: The list of eight algorithms is comprehensive, but justification for simpler models like Naive Bayes is weak. Justify the inclusion of Naïve Bayes. Is it possible because they were included as benchmarks. Methods Evaluation Metrics: AUC-ROC emphasized, but for imbalanced problems F1-Score or Precision-Recall AUC may be better. Also consider using F1-Score or Precision as the data is not balanced or Justify the use of AUC-ROC Methods Model Performance Presentation: Logistic Regression focus unclear since Gradient Boosting achieved higher AUC-ROC (0.85). Consider Gradient Boosting as it achieved AUC-ROC 0.85 OR Explain rationale (e.g., performance vs. interpretability). Results Confusion Matrix Analysis (Figure 3): Issue: The analysis states precision is "approximately 11%." Based on the described confusion matrix (TP=17, FP=138), precision is 17 / (17+138) = 11.0%. This is a critical weakness of the model that deserves more emphasis. It means ~89% of the people predicted to be EC users were actually non-users. This has huge implications for the cost and efficiency of any intervention based on this model Discuss this trade-off explicitly: "The model's high recall (85%) comes at the cost of low precision (11%), resulting in a high false positive rate. This suggests the model is well-suited as a screening tool where identifying most true cases is prioritized over resource efficiency, but would require secondary screening or low-cost interventions to target the large number of false positives." Discussion Addressing Limitations More Forcefully: Underreporting of EC likely major issue. Add: "A key limitation is the potential for significant underreporting of EC use due to social desirability bias and stigma..." Conclusion "myth-busting" Word choice is Informal. myth-dispelling Conclusion "stock guarantees of EC" Not clear Consider write as guaranteed EC stock availability Conclusion "This research provides an ethical and evidence-based blueprint to accelerate gains in reducing maternal mortality and advancing reproductive autonomy in Ethiopia and similar settings." – Awkward phrasing. .Conside rephrasing as ..blueprint to reduce maternal mortality and advance... Reviewer #4: This manuscript applies machine learning (ML) and explainable AI (XAI) methods to predict emergency contraceptive (EC) use among women in Ethiopia, using data from the 2023 PMA survey. The authors compare eight algorithms, address severe class imbalance with SMOTE, and use SHAP values to interpret predictors. They find that awareness of EC is the strongest predictor, followed by media exposure and health facility discussions, while demographic variables show limited predictive value.

      However, the results as currently presented are unreliable. Major inconsistencies in reported performance metrics (e.g., contradictory precision values, implausible Naive Bayes results, inflated accuracy) call into question the validity of the analyses. In addition, the small number of EC users makes the modeling unstable, and subgroup analyses are not feasible with this dataset. These issues, combined with over-interpretation of SHAP as causal, limit both the methodological credibility and substantive contribution of the paper.

      Contradictory precision results The performance metrics are inconsistent. Table 4 shows Logistic Regression with SMOTE achieving precision = 0.72 and recall = 0.85, yet the confusion matrix description reports precision at only ~11%. These cannot both be correct. This discrepancy raises questions about the accuracy of the reported results and must be clarified.

      Inflated accuracy The reported accuracy of 0.95 for Logistic Regression with SMOTE appears implausibly high given the extreme class imbalance (4.4% EC use). Accuracy is not an informative measure in this context, and such values raise concerns about potential data leakage or overly optimistic validation. The authors should confirm that the outcome variable or proxy features were not inadvertently included in the predictors.

      Over-interpretation of SHAP The SHAP analysis is framed in causal terms (e.g., a 30% increase in knowledge leading to a 12.7% increase in use). SHAP values describe associations within the model, not causal effects. The manuscript should temper these statements and present SHAP findings as indicators of relative predictive importance, not intervention outcomes.

      Implausible Naive Bayes results Naive Bayes is reported as having accuracy of only 0.06 pre-SMOTE. Given that 95% of the sample did not use EC, even a trivial majority-class classifier would achieve ~95% accuracy. Such a result suggests an error in coding or reporting that must be checked.

      Small minority class vs. model complexity Only 103 EC users were present in the dataset. Training and tuning eight algorithms with hyperparameter searches on such a small minority class risks overfitting and unstable results, even with SMOTE. This limitation should be acknowledged explicitly, with emphasis on the need for validation on independent samples.

      Subgroup analysis claims The manuscript claims fairness testing across subgroups (rural/urban, religion, age), but no results are presented. With so few EC users, subgroup analyses would be underpowered and unreliable. It would be more appropriate to note this limitation rather than imply subgroup robustness.

      Causality Issue The manuscript repeatedly interprets predictive associations as though they were causal effects. For example, SHAP values are used to suggest that increasing knowledge by 30% would increase EC use by 12.7%. Since the data are cross-sectional and observational, such statements are not justified. Machine learning models in this setting can identify predictive patterns, but they cannot establish causal relationships between predictors and outcomes. This overreach is particularly concerning because it could mislead policymakers or practitioners into believing the study provides evidence of causal effects. Reviewer #5: Summary This study investigates the underuse of emergency contraception in Ethiopia using a machine learning framework. Strengths include the application of multiple algorithms, careful handling of class imbalance, and the use of Explainable AI to interpret model outputs. The paper is generally well-structured, and the methodological workflow is presented clearly. At the same time, the results are presented in a way that overstates the model’s practical utility while giving insufficient attention to the precision–recall trade-off. The manuscript should be revised to consistently acknowledge the low precision across the abstract, results, and discussion, and to provide a clear justification for the relevance of a high-recall, low-precision model in this public health context. The limitation posed by the small number of positive cases in the validation set should also be explicitly discussed. Addressing these points is necessary to strengthen the scientific validity of the work. Specific comments 1. Title; It should be shortened to remove redundancy since Utilization and Usage mean the same thing 2. Abstract. I think something key was missed. The aurthors state a recall of 0.85 without mentioning the precision. I see that (Figure 3, page 20) show that the precision is approximately 11%. My understanding of this that for every 100 women the model flags as likely EC non-users who need intervention, 89 of them are false alarms. An abstract must present a balanced view of performance. 3. Methods (About the data): A sample size of 2,334 with a 4.4% prevalence means you only have ~103 positive cases (EC users). After an 80/20 train-test split, your test set contains only ~21 positive cases. This number is critically small and raises serious questions about the stability and generalizability of your reported performance metrics. A different random split could yield vastly different results. I suggest that such a major limitation is addressed upfront in the limitations section and acknowledged in the methods section. 4. Data balancing; I like the write up of this section 5. Evaluation Metrics; The text states the test set has 18.7% EC users, but the abstract and data balancing section state the overall prevalence is 4.4%. Please clarify this discrepancy. Is 18.7% a typo? Or did the stratified split result in a test set with a much higher prevalence than the overall dataset? This needs to be consistent. Could you also add the precision-recall plots, since you state that they were tracked. 6. Results: - In Table 4, the columns are F1 and Score. This seems like a typo. It should likely be a single column: F1 Score. Please correct. - Lastly, i think it would be good to acknowledge the weaknesses of SMOTE Reviewer #6: The title of the article is: Predicting Utilization of Emergency Contraceptive Usage in Ethiopia and Identifying Its Predictors Using Machine Learning. The author explains that traditional analyses have struggled to identify complex predictors and therefore they used machine learning (ML) and Explainable AI (XAI) to improve the prediction and interpretability of Emergency Contraceptive (EC) use. The paper can be published with the following corrections and some are extremely important. In particular methodological perspectives. Category Authors Contribution Comments Objectives The primary objectives are twofold:

      one, to predict the likelihood of EC use with far greater accuracy than conventional regression techniques;

      two, to identify the key modifiable socio-behavioural predictors e.g., self-efficacy, mass media exposure, provider perception, and women's autonomy through XAI methods like SHAP values to yield interpretability and actionable insights. First objective can be modified. Far greater is a vague statement. Measuring accuracy is an indicator of choosing between models but conventional regression techniques why has a problem in this study should focus on that.

      Second objective seems motivation of the study. This objective should be written in clear sentence. Identify predictors to yield interpretability and actionable insights are subjective things. These objective seems ambiguous.

      Methodological view Page 5: Methodologically, it represents a new contribution by rigorously testing the performance of eight alternative ML classifiers and developing an optimized analytical pipeline specifically designed to handle skewed healthcare datasets prevalent in rare outcomes like EC use

      Theoretically, it applies the Socio-Ecological Model (SEM) framework to hierarchically analyze predictors at levels of individual (knowledge, attitudes), interpersonal (partner communication, family influence), community (stigma norms, access), and policy (health system factors) providing an integrated explanation for the interrelating influences on EC behavior. It is not methodological contribution.

      Moreover, author mentioned theoretical contribution. However, it is just exploratory of the data.

      Methodology In page 4: In contrast to conventional statistical approaches, ML algorithms, such as random forests, gradient boosting machines (e.g., XGBoost), and neural networks, can particularly identify complex, high-dimensional patterns within diverse data sets, properly manage missing data, and produce personalized risk predictions with improved accuracy Author mentioned several times about conventional statistical technique. However, in the report author directly reported the model performance of ML. My suggestion is to first run the analysis using traditional or conventional methods and then compare with ML techniques. This is very important. Outcome Variable Page 8: The outcome of interest is EC Usage, a binary measure of whether emergency contraception was used in the last 12 months. This is the dependent variable for analysis. Redundant as at the beginning you mentioned outcome of interest is….. Missing data For handling missingness in our data, a stratified approach based on missingness mechanisms and rates was followed and so on……….. The author used many approaches and it is difficult to keep track. So it is better to explain step by step and pros and cons of each process. Moreover, explain why this approach is best in this study Variables Page 12

      Lots of category under one variable. Some category has very few observations. Justify the necessity. May be we can also show some cross-tabulation analysis result and report the p-value. Research Gap Page 19: The research goes beyond the correlational limitations of previous studies by utilizing predictive analytics to identify the modifiable factors and approximate their hypothetical effects What do you mean by correlational limitations? Moreover, over the report the previous studies were not mentioned in comparison to the authors current approaches. Sa add some recent references and explain the research gap. The Machine learning techniques are not new. So it is required to mention how those machine learning helps in your study as a novelty. All over the report there is a missing of synchronization and coherence of sentences. Moreover, the references, table titles etc are not space maintained. Abstract 1. SMOTE and SHAP 2. Conversely, recent reproductive events such as unintended pregnancy were linked to non-use. Static demographic factors showed poor predictive value. Findings highlight that knowledge gaps, not poverty or access, are key barriers to EC use. Tailored media campaigns and routine health counseling could enhance EC uptake. ML and XAI offer powerful tools for guiding targeted reproductive health interventions. 1. Did not mention what it is?

      1. The message of these sentences are not coherent. I think author can check the whole paper from an English native reviewer.

      R1:

      Reviewer #4: I appreciate the authors' thoughtful revisions and detailed responses. Several of my earlier comments were addressed—specifically, the correction of Naive Bayes reporting errors, improved acknowledgment of sample size limitations, and removal of unsupported subgroup analyses. These are welcome improvements. However, key concerns about the internal consistency of results, causal interpretation of SHAP analyses, and overextension of policy recommendations remain unresolved.

      First, while the outdated "11% precision" text has been removed, the confusion matrix values (TP=102, FP=180, FN=18) still do not correspond to the reported performance metrics. With these numbers, precision would equal roughly 0.36, not the 0.72 cited in Table 4. This suggests an ongoing internal inconsistency between the descriptive counts and the summary metrics. The lack of alignment raises continuing doubts about the reliability of the reported model performance.

      Second, the manuscript still places heavy emphasis on accuracy values approaching 0.92–0.95 despite a highly imbalanced outcome (4.4% EC use). Although the authors state that AUC-ROC and recall were prioritized, the presentation continues to foreground accuracy, which is misleading in this context. No calibration or uncertainty measures (e.g., Brier score, calibration curve) have been added, leaving the reader without a sense of how well the predicted probabilities reflect actual risk.

      Third, although the authors softened their language, the interpretation of SHAP values remains quasi-causal. The new statement—"counterfactual simulation using SHAP values … suggested that a 30% increase in EC knowledge could potentially increase utilization by approximately 12.7%", still presents SHAP outputs as if they represent real-world intervention effects. SHAP analysis identifies predictive associations within a model; it does not estimate the causal impact of changing a feature in the population. Likewise, subsequent phrases such as “integrating a predictive risk-scoring tool can help identify women at high risk” and “geographic machine learning modeling can optimize resource deployment” continue to frame the model as a validated operational tool. These remain prescriptive policy claims that move beyond what a cross-sectional, unvalidated predictive study can substantiate.

      Finally, while the tone of the manuscript has improved, the discussion still reads as policy advocacy rather than analytical interpretation. Phrases like "representing a valuable public health gain”" and "can help optimize resource deployment" give the impression of proven effectiveness rather than exploratory modeling. A clearer distinction between predictive insights and causal or operational evidence is necessary for the study to maintain methodological integrity.

    1. Thank you for submitting this paper. I think the paper requires substantial, major revisions to be published. Throughout the paper I noted many instances where references or examples would help make the intent clear. I also think the message of the paper would benefit from several figures to demonstrate workflows or ideas. The figures presented are essentially tables, and I think the message could be made clearer for the reader if they were presented as flow charts or at least with clear numbering to hook the ideas to the reader - e.g., Figures 1 & 2 would benefit from having numbers on the key ideas.

      The paper is lacking many instances of citation, and at times reads as though it is an essay delivering an opinion. I'm not sure if this is the type of article that the journal would like, but two examples of sentences missing citations are:

      1. "Over the last two decades, an unexpectedly large number of peer-reviewed findings across many scientific disciplines have been found to be irreproducible upon closer inspection." (Introduction, page 2)

      2. "A large number of examples cited in this context involves faulty software or inappropriate use of software" (Introduction, page 3)

      Two examples of sentences missing examples are:

      1. Experimental software evolves at a much faster pace than mature software, and documentation is rarely up to date or complete (in Mature vs. experimental software, page 7). Could the author provide more examples of what "experimental software" is? There is also consistent use of universal terms like "...is rarely up to date or complete", which would be better phrased as "is often not up to date or complete"

      2. There are various techniques for ensuring or verifying that a piece of software conforms to a formal specification.

      Overall the paper introduces many new concepts, and I think it would greatly benefit from being made shorter and more concise, with adding some key figures for the reader to refer back to to understand these new ideas. The paper is well written, and it is clear the author is a great writer, and has put a lot of thought into the ideas. However it is my opinion that because these ideas are so big and require so much unpacking, they are also harder to understand. The reader would benefit from having more guidance to come back to understand these ideas.

      I hope this review is helpful to the author.

      Review comments

      Introduction

      Highlight [page 2]: Ever since the beginnings of organized science in the 17th century, researchers are expected to put all facts supporting their conclusions on the table, and allow their peers to inspect them for accuracy, pertinence, completeness, and bias. Since the 1950s, critical inspection has become an integral part of the publication process in the form of peer review, which is still widely regarded as a key criterion for trustworthy results.

      • and Note [page 2]: Both of these statements feel like they should have some peer review, or reference on them, I believe. What was the beginnings of organised science in the 1600s? Why since the 1950s? Why not sooner? What happened then?

      Highlight [page 2]: Over the last two decades, an unexpectedly large number of peer-reviewed findings across many scientific disciplines have been found to be irreproducible upon closer inspection.

      Highlight [page 2]: In the quantitative sciences, almost all of today’s research critically relies on computational techniques, even when they are not the primary tool for investigation - and Note [page 2]: Again, it does feel like it would be great to acknowledge research in this space.

      Highlight [page 2]: But then, scientists mostly abandoned doubting.

      • and Note [page 2]: This feels like an essay, where show me the evidence for where you can say something like this?

      Highlight [page 2]: Automation bias

      • and Note [page 2]: What is automation bias?

      Highlight [page 3]: A large number of examples cited in this context involves faulty software or inappropriate use of software

      • and Note [page 3]: Can you provide some examples of the examples cited that you are referring to here?

      Highlight [page 3]: A particularly frequent issue is the inappropriate use of statistical inference techniques.

      • and Note [page 3]: Please provide citations to these frequent issues.

      Highlight [page 3]: The Open Science movement has made a first step towards dealing with automated reasoning in insisting on the necessity to publish scientific software, and ideally making the full development process transparent by the adoption of Open Source practices - and Note [page 3]: Could you provide an example of one of these Open Science movements?

      Highlight [page 3]: Almost no scientific software is subjected to independent review today.

      • and Note [page 3]: How can you justify this claim?

      Highlight [page 3]: In fact, we do not even have established processes for performing such reviews

      Highlight [page 3]: as I will show

      • and Note [page 3]: How will you show this?

      Highlight [page 3]: is as much a source of mistakes as defects in the software itself

      • and Note [page 3]: Again, this feels like a statement of fact without evidence or citation.

      Highlight [page 3]: This means that reviewing the use of scientific software requires particular attention to potential mismatches between the software’s behavior and its users’ expectations, in particular concerning edge cases and tacit assumptions made by the software developers. They are necessarily expressed somewhere in the software’s source code, but users are often not aware of them.

      • and Note [page 3]: The same can be said of assumptions for equations and mathematics - the problem here is dealing with abstraction of complexity and the potential unintended consequences.

      Highlight [page 4]: the preservation of epistemic diversity

      • and Note [page 4]: Please define epistemic diversity
      Reviewability of automated reasoning systems

      Highlight [page 5]: The five dimensions of scientific software that influence its reviewability.

      • and Note [page 5]: It might be clearer to number these in the figure, and also I might suggest changing the “convivial” - it’s a pretty unusual word?
      Wide-spectrum vs. situated software

      Highlight [page 6]: In between these extremes, we have in particular domain libraries and tools, which play a very important role in computational science, i.e. in studies where computational techniques are the principal means of investigation

      • and Note [page 6]: I’m not very clear on this example - can you provide an example of a “domain library” or “domain tool” ?

      Highlight [page 6]: Situated software is smaller and simpler, which makes it easier to understand and thus to review.

      • and Note [page 6]: I’m not sure I agree it is always smaller and simpler - the custom code for a new method could be incredibly complicated.

      Highlight [page 6]: Domain tools and libraries

      • and Note [page 6]: Can you give an example of this?
      Mature vs. experimental software

      Highlight [page 7]: Experimental software evolves at a much faster pace than mature software, and documentation is rarely up to date or complete

      • and Note [page 7]: Could the author provide more examples of what “experimental software” is? There is also consistent use of universal terms like “…is rarely up to date or complete”, which would be better phrased as “is often not up to date or complete”

      Highlight [page 7]: An extreme case of experimental software is machine learning models that are constantly updated with new training data.

      • and Note [page 7]: Such as…

      Highlight [page 7]: interlocutor

      • and Note [page 7]: suggest “middle man” or “mediator”, ‘interlocutor’ isn’t a very common word

      Highlight [page 7]: A grey zone

      • and Note [page 7]: I think it would be helpful to discuss black and white zones before this.

      Highlight [page 7]: The libraries of the scientific Python ecosystem

      • and Note [page 7]: Do you mean SciPy? https://scipy.org/. Can you provide an example of the frequent changes that break backward compatibility?

      Highlight [page 7]: too late that some of their critical dependencies are not as mature as they seemed to be

      • and Note [page 7]: Again, can you provide some evidence for this?

      Highlight [page 7]: The main difference in practice is the widespread use of experimental software by unsuspecting scientists who believe it to be mature, whereas users of instrument prototypes are usually well aware of the experimental status of their equipment.

      • and Note [page 7]: Again this feels like an assertion without evidence. Is this an essay, or a research paper?
      Convivial vs. proprietary software

      Highlight [page 8]: Convivial software [Kell 2020], named in reference to Ivan Illich’s book “Tools for conviviality” [Illich 1973], is software that aims at augmenting its users’ agency over their computation

      • and Note [page 8]: It would be really helpful if the author would define the word, “convivial” here. It would also be very useful if they went on to give an example of what they meant by: “…software that aims at augmenting its users’ agency over their computation.” How does it augment the users agency?

      Highlight [page 8]: Shaw recently proposed the less pejorative term vernacular developers [Shaw 2022]

      • and Note [page 8]: Could you provide an example of what makes “vernacular developers” different, or just what they mean by this term?

      Highlight [page 8]: which Illich has described in detail

      • and Note [page 8]: Should this have a citation to Illich then in this sentence?

      Highlight [page 8]: what has happened with computing technology for the general public

      • and Note [page 8]: Can you give an example of this. Do you mean the rise of Apple and Windows? MS Word? Facebook? A couple of examples would be really useful to make this point clear.

      Highlight [page 8]: tech corporations

      • and Note [page 8]: Suggest “tech corporations” be “technology corporations”.

      Highlight [page 8]: Some research communities have fallen into this trap as well, by adopting proprietary tools such as MATLAB as a foundation for their computational tools and models.

      • and Note [page 8]: Can you provide an example of the alternative here, what would be the way to avoid this trap - use software such as Octave, or?

      Highlight [page 8]: Historically, the Free Software movement was born in a universe of convivial technology.

      • and Note [page 8]: If it is historic, can you please provide a reference to this?

      Highlight [page 8]: most of the software they produced and used was placed in the public domain

      • and Note [page 8]: Can you provide an example of this? I’m also curious how the software was placed in the public domain if there was no way to distribute it via the internet.

      Highlight [page 8]: as they saw legal constraints as the main obstacle to preserving conviviality

      • and Note [page 8]: Again, these are conjectures that are lacking a reference or example, can you provide some examples of references of this?

      Highlight [page 9]: Software complexity has led to a creeping loss of user agency, to the point that even building and installing Open Source software from its source code is often no longer accessible to non-experts, making them dependent not only on the development communities, but also on packaging experts. An experience report on building the popular machine learning library PyTorch from source code nicely illustrates this point [Courtès 2021].

      • and Note [page 9]: Can you summarise what makes it difficult to install Open Source Software? Again, this statement feels like it is making a strong generalisation without clear evidence to support this. The article by Courtès (https://hpc.guix.info/blog/2021/09/whats-in-a-package/), actually notes that it’s straightforward to install PyTorch via pip, but using an alternative package manager causes difficulty. The point you are making here seems to be that building and installing most open source software is almost prohibitive, but I think you’ve given strong evidence for this claim, and I don’t understand how this builds into your overall argument.

      Highlight [page 9]: It survives mainly in communities whose technology has its roots in the 1980s, such as programming systems inheriting from Smalltalk (e.g. Squeak, Pharo, and Cuis), or the programmable text editor GNU Emacs.

      • and Note [page 9]: Can you give an example of how it survives in these communities?

      Highlight [page 9]: FLOSS has been rapidly gaining in popularity, and receives strong support from the Open Science movement

      • and Note [page 9]: Can you provide some evidence to back this statement up?

      Highlight [page 9]: the traditional values of scientific research.

      • and Note [page 9]: Can you state what you mean by “traditional values of scientific research”

      Highlight [page 9]: always been convivial

      • and Note [page 9]: Can you provide a further explanation of what makes them convivial?
      Transparent vs. opaque software

      Highlight [page 9]: Transparent software

      • and Note [page 9]: It might be useful to explain a distinction between transparent and open software - or to perhaps open with a statement for why we are talking about transparent and opaque software.

      Highlight [page 9]: Large language models are an extreme example.

      • and Note [page 9]: Based on your definition of transparent software - every action produces a visible result. If I type something into an LLM and get an immediate and visible result, how is this different? It is possible you are stating that the behaviour is able to be easily interpreted, or perhaps the behaviour is easy to understand?

      Highlight [page 10]: Even highly interactive software, for example in data analysis, performs nonobvious computations, yielding output that an experienced user can perhaps judge for plausibility, but not for correctness.

      • and Note [page 10]: Could you give a small example of this?

      Highlight [page 10]: It is much easier to develop trust in transparent than in opaque software.

      • and Note [page 10]: Can you state why it is easier to develop this trust?

      Highlight [page 10]: but also less important

      • and Note [page 10]: Can you state why it is less important?

      Highlight [page 10]: even a very weak trustworthiness indicator such as popularity becomes sufficient

      • and Note [page 10]: becomes sufficient for what? Reviewing? Why does it become sufficient?

      Highlight [page 10]: This is currently a much discussed issue with machine learning models,

      • and Note [page 10]: Given it is currently much discussed, could you link to at least 2 research articles discussing this point?

      Highlight [page 10]: treated extensively in the philosophy of science.

      • and Note [page 10]: Given that is has been treated extensively, can you please provide some key references after this statement? You do go on to cite one paper, but it would be helpful to mention at least a few key articles.
      Size of the minimal execution environment

      Highlight [page 11]: The importance of this execution environment is not sufficiently appreciated by most researchers today, who tend to consider it a technical detail

      • and Note [page 11]: This statement is a bit of a sweeping generalisation - why is it not sufficiently appreciated? What evidence do you have of this?

      Highlight [page 11]: Software environments have only recently been recognized as highly relevant for automated reasoning in science and beyond

      • and Note [page 11]: Where have they been only recently recognised?

      Highlight [page 11]: However, they have not yet found their way into mainstream computational science.

      • and Note [page 11]: Could you provide an example of what it might look like if they were in mainstream computational science? For example, https://github.com/ropensci/rix implements using reproducible environments for R with NIX. What makes this not mainstream? Are you talking about mainstream in the sense of MS Excel? SPSS/SAS/STATA?
      Analogies in experimental and theoretical science

      Highlight [page 12]: Non-industrial components are occasionally made for special needs, but this is discouraged by their high manufacturing cost

      • and Note [page 12]: Can you provide an example of this?

      Highlight [page 12]: cables

      • and Note [page 12]: What do you mean by a cable? As in a computer cable? An electricity cable?

      Highlight [page 13]: which an experienced microscopist will recognize. Software with a small defect, on the other hand, can introduce unpredictable errors in both kind and magnitude, which neither a domain expert nor a professional programmer or computer scientist can diagnose easily.

      • and Note [page 13]: I don’t think this is a fair comparison. Surely there must be instances of experiences microscopists not identifying defects? Similarly, why can’t there be examples of domain expert or professional programmer/computer scientist identifying errors. Don’t unit tests help protect us against some of our errors? Granted, they aren’t bullet proof, and perhaps act more like guard rails.

      Highlight [page 13]: where “traditional” means not relying on any form of automated reasoning.

      • and Note [page 13]: Can you give an example of what a “traditional” scientific model or theory
      Improving the reviewability of automated reasoning systems

      Highlight [page 14]: Figure 2: Four measures that can be taken to make scientific software more trustworthy.

      • and Note [page 14]: Could the author perhaps instead call these “four measures” or perhaps give them a better name, and number them?
      Review the reviewable

      Highlight [page 14]: mature wide-spectrum software

      • and Note [page 14]: Can you give an example of what “mature wide-spectrum software” is?

      Highlight [page 15]: The main difficulty in achieving such audits is that none of today’s scientific institutions consider them part of their mission.

      Science vs. the software industry

      Highlight [page 15]: Many computers, operating systems, and compilers were designed specifically for the needs of scientists.

      • and Note [page 15]: Could you give an example of this? E.g., FORTRAN? COBAL?

      Highlight [page 15]: Today, scientists use mostly commodity hardware

      • and Note [page 15]: Can you explain what you mean by “commodity hardware”, and give an example.

      Highlight [page 15]: even considered advantageous if it also creates a barrier to reverse- engineering of the software by competitors

      • and Note [page 15]: Can you give an example of this?

      Highlight [page 15]: few customers (e.g. banks, or medical equipment manufacturers) are willing to pay for

      • and Note [page 15]: What about software like SPSS/STATA/SAS - surely many many industries, and also researchers will pay for software like this that is considered mature?
      Emphasize situated and convivial software

      Highlight [page 16]: a convivial collection of more situated modules, possibly supported by a shared wide-spectrum layer.

      • and Note [page 16]: Could you give an example of what this might look like practically? Are you saying things like SciPy would be restructured into many separate modules, or?

      Highlight [page 16]: In terms of FLOSS jargon, users make a partial fork of the project. Version control systems ensure provenance tracking and support the discovery of other forks. Keeping up to date with relevant forks of one’s software, and with the motivations for them, is part of everyday research work at the same level as keeping up to date with publications in one’s wider community. In fact, another way to describe this approach is full integration of scientific software development into established research practices, rather than keeping it a distinct activity governed by different rules.

      • and Note [page 16]: Could the author provide a diagram or schematic to more clearly show how such a system would work with forks etc?

      Highlight [page 17]: a universe is very

      • and Note [page 17]: Perhaps this could be “would be very different” - since this doesn’t yet exist, right?

      Highlight [page 17]: Improvement thus happens by small-step evolution rather than by large-scale design. While this may look strange to anyone used to today’s software development practices, it is very similar to how scientific models and theories have evolved in the pre-digital era.

      • and Note [page 17]: I think some kind of schematic or workflow to compare existing practices to this new practice would be really useful to articulate these points. I also think this new method of development you are proposing should have a concrete name.

      Highlight [page 17]: Existing code refactoring tools can probably be adapted to support application-specific forks, for example via code specialization. But tools for working with the forks, i.e. discovering, exploring, and comparing code from multiple forks, are so far lacking. The ideal toolbox should support both forking and merging, where merging refers to creating consensual code versions from multiple forks. Such maintenance by consensus would probably be much slower than maintenance performed by a coordinated team.

      • and Note [page 17]: Perhaps an example of screenshot of a diff could be used to demonstrate that we can make these changes between two branches/commits, but comparing multiple is challenging?
      Make scientific software explainable

      Highlight [page 18]: An interesting line of research in software engineering is exploring possibilities to make complete software systems explainable [Nierstrasz and Girba 2022]. Although motivated by situated business applications, the basic ideas should be transferable to scientific computing

      • and Note [page 18]: Is this similar to concepts such as “X-AI” or “X-ML” - that is, “Explainable” Artificial Intelligence or Machine Learning?

      Highlight [page 18]: Unlike traditional notebooks, Glamorous Toolkit [feenk.com 2023],

      • and Note [page 18]: It appears that you have introduced “Glamorous Toolkit” as an example of these three principles? It feels like it should be introduced earlier in this paragraph?

      Highlight [page 18]: In Glamorous Toolkit, whenever you look at some code, you can access corresponding examples (and also other references to the code) with a few mouse clicks

      • and Note [page 18]: I think it would be very beneficial to show screenshots of what the author means - while I can follow the link to Glamorous Toolkit, bitrot is a thing, and that might go away, so it would good to see exactly what the author means when they discuss these examples.
      Use Digital Scientific Notations

      Highlight [page 18]: There are various techniques for ensuring or verifying that a piece of software conforms to a formal specification

      • and Note [page 18]: Can you give an example of these techniques?

      Highlight [page 18]: The use of these tools is, for now, reserved to software that is critical for safety or security,

      • and Note [page 18]: Again, could you give an example of this point? Which tools, and which software is critical for safety or security?

      Highlight [page 19]: formal specifications

      • and Note [page 19]: It would be really helpful if you could demonstrate an example of a formal specification so we can understand how they could be considered constraints.

      Highlight [page 19]: All of them are much more elaborate than the specification of the result they produce. They are also rather opaque.

      • and Note [page 19]: It isn’t clear to me how these are opaque - if the algorithm is defined, it can be understood, how is it opaque?

      Highlight [page 19]: Moreover, specifications are usually more modular than algorithms, which also helps human readers to better understand what the software does [Hinsen 2023]

      • and Note [page 19]: A tight example of this would be really useful to make this point clear. Perhaps with a figure of a specification alongside an algorithm.

      Highlight [page 19]: In software engineering, specifications are written to formalize the expected behavior of the software before it is written. The software is considered correct if it conforms to the specification.

      • and Note [page 19]: Is an example of this test drive development?

      Highlight [page 19]: A formal specification has to evolve in the same way, and is best seen as the formalization of the scientific knowledge. Change can flow from specification to software, but also in the opposite direction.

      • and Note [page 19]: Again, I think a good figure here would be very helpful in articulating this clearly.

      Highlight [page 19]: My own experimental Digital Scientific Notation, Leibniz [Hinsen 2024], is intended to resemble traditional mathematical notation as used e.g. in physics. Its statements are embeddable into a narrative, such as a journal article, and it intentionally lacks typical programming language features such as scopes that do not exist in natural language, nor in mathematical notation.

      • and Note [page 19]: Could we see an example of what this might look like?
      Conclusion

      Highlight [page 20]: Situated software is easy to recognize.

      • and Note [page 20]: Could you provide some examples?

      Highlight [page 20]: Examples from the reproducibility crisis support this view

      • and Note [page 20]: Can you provide some example papers that you mention here?

      Highlight [page 21]: The ideal structure for a reliable scientific software stack would thus consist of a foundation of mature software, on top of which a transparent layer of situated software, such as a script, a notebook, or a workflow, orchestrates the computations that together answer a specific scientific question. Both layers of such a stack are reviewable, as I have explained in section 3.1, but adequate reviewing processes remain to be enacted.

      • and Note [page 21]: Again, I think it would be very insightful for the reader to have a clear figure to rest these ideas upon.

      Highlight [page 21]: has been neglected by research institutions all around the world

      • and Note [page 21]: I do not think this is true - could you instead say “neglected my most/many” perhaps?
    2. Dear editors and reviewers, Thank you for your careful reading of my manuscript and the detailed and insightful feedback. It has contributed significantly to the improvements in the revised version. Please find my detailed responses below.

      1 Reviewer 1

      Thank you for this helpful review, and in particular for pointing out the need for more references, illustrations, and examples in various places of my manuscript. In the case of the section on experimental software, the search for examples made clear to me that the label was in fact badly chosen. I have relabeled the dimension as “stable vs. evolving software”, and rewritten the section almost entirely. Another major change motivated by your feedback is the addition of a figure showing the structure of a typical scientific software stack (Fig. 2), and of three case studies (section 2.7) in which I evaluate scientific software packages according to my five dimensions of reviewability. The discussion of conviviality (section 2.4), a concept that is indeed not widely known yet, has been much expanded. I have followed the advice to add references in many places. I have been more hesitant to follow the requests for additional examples and illustrations, because of the inevitable conflict with the equally understandable request to make the paper more compact. In many cases, I have preferred to refer to examples discussed in the literature. A few comments deserve a more detailed reply:

      Introduction

      Highlight [page 3]: In fact, we do not even have established processes for performing such reviews

      and Note [page 3]: I disagree, there is the Journal of Open Source Software: https://joss.theoj.org/, rOpenSci has a guide for development of peer review of statistical software: https://github.com/ropensci/statistical software-review-book, and also maintain a very clear process of software review: https://ropensci.org/software-review/

      As I say in the section “Review the reviewable”, these reviews are not independent critical examination of the software as I define it. Reviewers are not asked to evaluate the software’s correctness or appropriateness for any specific purpose. They are expected to comment only on formal characteristics of the software publication process (e.g. “is there a license?”), and on a few software engineering quality indicators (“is there a test suite?”).

      Highlight [page 3]: This means that reviewing the use of scientific software requires particular attention to potential mismatches between the software’s behavior and its users’ expectations, in particular concerning edge cases and tacit assumptions made by the software developers. They are necessarily expressed somewhere in the software’s source code, but users are often not aware of them.

      and Note [page 3]: The same can be said of assumptions for equations and mathematics- the problem here is dealing with abstraction of complexity and the potential unintended consequences.

      Indeed. That’s why we need someone other than the authors to go through mathematical reasoning and verify it. Which we do.

      Reviewability of automated reasoning systems

      Wide-spectrum vs. situated software

      Highlight [page 6]: Situated software is smaller and simpler, which makes it easier to understand and thus to review.

      and Note [page 6]: I’m not sure I agree it is always smaller and simpler- the custom code for a new method could be incredibly complicated.

      The comparison is between situated software and more generic software performing the same operation. For example, a script reading one specific CSV file compared to a subroutine reading arbitrary CSV files. I have yet to see a case in which abstraction from a concrete to a generic function makes code smaller or simpler.

      Convivial vs. proprietary software

      Highlight [page 8]: most of the software they produced and used was placed in the public domain

      and Note [page 8]: Can you provide an example of this? I’m also curious how the software was placed in the public domain if there was no way to distribute it via the internet.

      Software distribution in science was well organized long before the Internet, it was just slower and more expensive. Both decks of punched cards and magnetic tapes were routinely sent by mail. The earliest organized software distribution for science I am aware of was the DECUS Software Library in the early 1960s.

      Size of the minimal execution environment

      Note [page 11]: Could you provide an example of what it might look like if they were in mainstream computational science? For example, https://github.com/ropensci/rix implements using reproducible environments for R with NIX. What makes this not mainstream? Are you talking about mainstream in the sense of MS Excel? SPSS/SAS/STATA?

      I have looked for quantitative studies on software use in science that would allow to give a precise meaning to “mainstream”, but I have not been able to find any. Based on my personal experience, mostly with teaching MOOCs on computational science in which students are asked about the software they use, the most widely used platform is Microsoft Windows. Linux is already a minority platform (though overrepresented in computer science), and Nix users are again a small minority among Linux users.

      Analogies in experimental and theoretical science

      Highlight [page 13]: which an experienced microscopist will recognize. Soft ware with a small defect, on the other hand, can introduce unpredictable errors in both kind and magnitude, which neither a domain expert nor a professional programmer or computer scientist can diag- nose easily.

      and Note [page 13]: I don’t think this is a fair comparison. Surely there must be instances of experiences microscopists not identifying defects? Similarly, why can’t there be examples of domain expert or professional program mer/computer scientist identifying errors. Don’t unit tests help protect us against some of our errors? Granted, they aren’t bullet proof, and perhaps act more like guard rails.

      There are probably cases of microscopists not noticing defects, but my point is that if you ask them to look for defects, they know what to do (and I have made this clearer in my text). For contrast, take GROMACS (one of my case studies in the revised manuscript) and ask either an expert programmer or an experienced computational biophysicist if it correctly implements, say, the AMBER force field. They wouldn’t know what to do to answer that question, both because it is ill-defined (there is no precise definition of the AMBER force field) and because the number of possible mistakes and symptoms of mistakes is enormous. I have seen a protein simulation program fail for proteins whose number of atoms was in a narrow interval, defined by the size that a compiler attributed to a specific data structure. I was able to catch and track down this failure only because a result was obviously wrong for my use case. I have never heard of similar issues with microscopes.

      Improving the reviewability of automated reasoning systems

      Review the reviewable

      Highlight [page 15]: The main difficulty in achieving such audits is that none of today’s scientific institutions consider them part of their mission.

      and Note [page 15]: I disagree. Monash provides an example here where they view software as a first class research output: https://robjhyndman.com/files/EBS_research_software.pdf

      This example is about superficial reviews in the context of career evaluation. Other institutions have similar processes. As far as I know, none of them ask reviewers to look at the actual code and comment on its correctness or its suitability for some specific purpose.

      Science vs. the software industry

      Highlight [page 15]: few customers (e.g. banks, or medical equipment manufacturers) are willing to pay for

      and Note [page 15]: What about software like SPSS/STATA/SAS- surely many many industries, and also researchers will pay for software like this that is considered mature?

      I could indeed extend the list of examples to include various industries. Compared to the huge number of individuals using PCs and smartphones, that’s still few customers.

      Emphasize situated and convivial software

      Note [page 16]: Could the author provide a diagram or schematic to more clearly show how such a system would work with forks etc?

      I have decided the contrary: I have significantly shortened this section, removing all speculation about how the ideas could be turned into concrete technology. The reason is that I have been working on this topic since I wrote the reviewed version of this manuscript, and I have a lot more to say about it than would be reasonable to include in this work. This will become a separate article.

      Make scientific software explainable

      Note [page 18]: I think it would be very beneficial to show screenshots of what the author means- while I can follow the link to Glamorous Toolkit, bitrot is a thing, and that might go away, so it would good to see exactly what the author means when they discuss these examples.

      Unfortunately, static screenshots can only convey a limited impression of Glamorous Toolkit, but I agree that they have are a more stable support than the software itself. Rather than adding my own screenshots, I refer to a recent paper by the authors of Glamorous Toolkit that includes many screenshots for illustration.

      Use Digital Scientific Notations

      Highlight [page 19]: formal specifications and Note [page 19]: It would be really helpful if you could demonstrate an example of a formal specification so we can understand how they could be considered constraints.

      Highlight [page 19]: Moreover, specifications are usually more modular than algorithms, which also helps human readers to better understand what the software does [Hinsen 2023]

      and Note [page 19]: A tight example of this would be really useful to make this point clear. Perhaps with a figure of a specification alongside an algorithm.

      I do give an example: sorting a list. To write down an actual formalized version, I’d have to introduce a formal specification language and explain it, which I think goes well beyond the scope of this article. Illustrating modularity requires an even larger example. This is, however, an interesting challenge which I’d be happy to take up in a future article.

      Highlight [page 19]: In software engineering, specifications are written to formalize the expected behavior of the software before it is written. The software is considered correct if it conforms to the specification.

      and Note [page 19]: Is an example of this test drive development?

      Not exactly, though the underlying idea is similar: provide a condition that a result must satisfy as evidence for being correct. With testing, the condition is spelt out for one specific input. In a formal specification, the condition is written down for all possible inputs.

      2 Reviewer 2

      First of all, I would like to thank the reviewer for this thoughtful review. It addresses many points that require clarifications in the my article, which I hope to have done adequately in the revised version.

      One such point is the role and form of reviewing processes for software. I have made it clearer that I take “review” to mean “critical independent inspection”. It could be performed by the user of a piece of software, but the standard case should be a review performed by experts at the request of some institution that then publishes the reviewer’s findings. There is no notion of gatekeeping attached to such reviews. Users are free to ignore them. Given that today, we publish and use scientific software without any review at all, the risk of shifting to the opposite extreme of having reviewers become gatekeepers seems unlikely to me.

      Your comment on users being software developers addresses another important point that I had failed to make clear: conviviality is all about diminishing the distinction between developers and users. Users gain agency over their computations at the price of taking on more of a developer role. This is now stated explicitly in the revised article. Your hypothesis that I want scientific software to be convivial is only partially true. I want convivially structured software to be an option for scientists, with adequate infrastructure and tooling support, but I do not consider it to be the best approach for all scientific software.

      The paragraph on the relevance and importance of reviewing in your comment is a valid point of view but, unsurprisingly, not mine. In the grand scheme of science, no specific quality assurance measure is strictly necessary. There is always another layer above that will catch mistakes that weren’t detected in the layer below. It is thus unlikely that unreliable software will cause all of science to crumble. But from many perspectives, including overall efficiency, personal satisfaction of practitioners, and insight derived from the process, it is preferable to catch mistakes as closely as possible to their source. Pre-digital theoreticians have always double-checked their manual calculations before submitting their papers, rather than sending off unchecked results and count on confrontation with experiment for finding mistakes. I believe that we should follow this same approach with software. The cost of mistakes can be quite high. Consider the story of the five retracted protein structures that I cite in my article (Miller, 2006, 10.1126/science.314.5807.1856). The five publications that were retracted involved years of work by researchers, reviewers, and editors. In between their publication and their retraction, other protein crystallographers saw their work rejected because it was in contradiction with the high-profile articles that later turned out to be wrong. The whole story has probably involved a few ruined careers in addition to its monetary cost. In contrast, independent critical examination of the software and the research processes in which it was used would likely have spotted the problem rather quickly (Matthews, 2007).

      You point out that reviewability is also a criterion in choosing software to build on, and I agree. Building on other people’s software requires trusting it. Incorporating it into one’s own work (the core principle of convivial software) requires understanding it. This is in fact what motivated my reflections on this topic. I am not much interested in neatly separating epistemic and practical issues. I am a practitioner, my interest in epistemology comes from a desire for improving practices.

      Review holism is something I have not thought about before. I consider it both impossible to apply in practice and of little practical value. What I am suggesting, and I hope to have made this clearer in my revision, is that reviewing must take into account the dependency graph. Reviewing software X requires a prior review of its dependencies (possibly already done by someone else), and a consideration of how each dependency influences the software under consideration. However, I do not consider Donoho’s “frictionless reproducibility” a sufficient basis for trust. It has the same problem as the widespread practice of tacitly assuming a piece of software to be correct because it is widely used. This reasoning is valid only if mistakes have a high chance of being noticed, and that’s in my experience not true for many kinds of research software. “It works”, when pronounced by a computational scientist, really means “There is no evidence that it doesn’t work”.

      This is also why I point out the chaotic nature of computation. It is not about Humphreys’ “strange errors”, for which I have no solution to offer. It is about the fact that looking for mistakes requires some prior idea of what the symptoms of a mistake might be. Experienced researchers do have such prior ideas for scientific instruments, and also e.g. for numerical algorithms. They come from an understanding of the instruments and their use, including in particular a knowledge of how they can go wrong. But once your substrate is a Turing-complete language, no such understanding is possible any more. Every programmer has made the experience of chasing down some bug that at first sight seems impossible. My long-term hope is that scientific computing will move towards domain-specific languages that are explicitly not Turing-complete, and offer useful guarantees in exchange. Unfortunately, I am not aware of any research in this space.

      I fully agree with you that internalist justifications are preferable to reliabilistic ones. But being fundamentally a pragmatist, I don’t care much about that distinction. Indisputable justification doesn’t really exist anywhere in science. I am fine with trust that has a solid basis, even if there remains a chance of failure. I’d already be happy if every researcher could answer the question “why do you trust your computational results?” in a way that shows signs of critical reflection.

      What I care about ultimately is improving practices in computational science. Over the last 30 years, I have seen numerous mistakes being discovered by chance, often leading to abandoned research projects. Some of these mistakes were due to software bugs, but the most common cause was an incorrect mental model of what the software does. I believe that the best technique we have found so far to spot mistakes in science is critical independent inspection. That’s why I am hoping to see it applied more widely to computation.

      2.1 References

      Miller, G. (2006) A Scientist’s Nightmare: Software Problem Leads to Five Retractions. Science 314, 1856. https://doi.org/10.1126/science.314.5807.1856

      Matthews, B.W. (2007) Five retracted structure reports: Inverted or incorrect? Protein Science 16, 1013. https://doi.org/10.1110/ps.072888607

      3 Editor

      Bayesian methods often use MCMC, which is often slow and creates long chains of estimates; however, the chains will show if the likelihood does not have a clear maximum, which is usually from a badly specified model...

      That is an interesting observation I haven’t seen mentioned bedore. I agree that Bayesian inference is particularly amenable to inspection. One more reason to normalize inspection and inspectability in computational science.

      Some reflection on the growing use of AI to write software may be worthwhile.

      The use of AI in writing and reviewing software is a topic I have considered for this review, since the technology has evolved enormously since I wrote the current version of the manuscript. However, in view of reviewer 1’s constant admonition to back up statements with citations, I refrained from delving into this topic. We all know it’s happening, but it’s too early to observe a clear impact on research software. I have therefore limited myself to a short comment in the Conclusion section.

      I wondered if highly-used software should get more scrutiny.

      This is an interesting suggestion. If and when we get serious about reviewing code, resource allocation will become an important topic. For getting started, it’s probably more productive to review newly published code than heavily used code, because there is a better chance that authors actually act on the feedback and improve their code before it has many users. That in turn will help improve the reviewing process, which is what matters most right now, in my opinion.

      “supercomputers are rare”, should this be “relatively rare” or am I speaking from a privileged university where I’ve always had access to supercomputers.

      If you have easy access to supercomputer, you should indeed consider yourself privileged. But did you ever use supercomputer time for reviewing someone else’s work? I have relatively easy access to supercomputers as well, but I do have to make a re quest and promise to do innovative research with the allocated resources.

      I did think about “testthat” at multiple points whilst reading the paper (https://testthat.r-lib.org/)

      I hadn’t seen “testthat” before, not being much of a user of R. It looks interesting, and reminds me of similar test support features in Smalltalk which I found very helpful. Improving testing culture is definitely a valuable contribution to improving computational practices.

      Can badges on github about downloads and maturity help (page 7)?

      Badges can help, on GitHub or elsewhere, e.g. in scientific software catalogs. I see them as a coarse-grained output of reviewing. The right balance to find is between the visibility of a badge and the precision of a carefully written review report. One risk with badges is the temptation to automate the evaluation that leads to it. This is fine for quantitative measures such as test coverage, but what we mostly lack today is human expert judgement on software.

    1. Reviewer #1 (Public review):

      This paper describes a number of patterns of epistasis in a large fitness landscape dataset recently published by Papkou et al. The paper is motivated by an important goal in the field of evolutionary biology to understand the statistical structure of epistasis in protein fitness landscapes, and it capitalizes on the unique opportunities presented by this new dataset to address this problem.

      The paper reports some interesting previously unobserved patterns that may have implications for our understanding of fitness landscapes and protein evolution. In particular, Figure 5 is very intriguing. However, I have two major concerns detailed below. First, I found the paper rather descriptive (it makes little attempt to gain deeper insights into the origins of the observed patterns) and unfocused (it reports what appears to be a disjointed collection of various statistics without a clear narrative. Second, I have concerns with the statistical rigor of the work.

      (1) I think Figures 5 and 7 are the main, most interesting, and novel results of the paper. However, I don't think that the statement "Only a small fraction of mutations exhibit global epistasis" accurately describes what we see in Figure 5. To me, the most striking feature of this figure is that the effects of most mutations at all sites appear to be a mixture of three patterns. The most interesting pattern noted by the authors is of course the "strong" global epistasis, i.e., when the effect of a mutation is highly negatively correlated with the fitness of the background genotype. The second pattern is a "weak" global epistasis, where the correlation with background fitness is much weaker or non-existent. The third pattern is the vertically spread-out cluster at low-fitness backgrounds, i.e., a mutation has a wide range of mostly positive effects that are clearly not correlated with fitness. What is very interesting to me is that all background genotypes fall into these three groups with respect to almost every mutation, but the proportions of the three groups are different for different mutations. In contrast to the authors' statement, it seems to me that almost all mutations display strong global epistasis in at least a subset of backgrounds. A clear example is C>A mutation at site 3.

      1a. I think the authors ought to try to dissect these patterns and investigate them separately rather than lumping them all together and declaring that global epistasis is rare. For example, I would like to know whether those backgrounds in which mutations exhibit strong global epistasis are the same for all mutations or whether they are mutation- or perhaps position-specific. Both answers could be potentially very interesting, either pointing to some specific site-site interactions or, alternatively, suggesting that the statistical patterns are conserved despite variation in the underlying interactions.

      1b. Another rather remarkable feature of this plot is that the slopes of the strong global epistasis patterns seem to be very similar across mutations. Is this the case? Is there anything special about this slope? For example, does this slope simply reflect the fact that a given mutation becomes essentially lethal (i.e., produces the same minimal fitness) in a certain set of background genotypes?

      1c. Finally, how consistent are these patterns with some null expectations? Specifically, would one expect the same distribution of global epistasis slopes on an uncorrelated landscape? Are the pivot points unusually clustered relative to an expectation on an uncorrelated landscape?

      1d. The shapes of the DFE shown in Figure 7 are also quite interesting, particularly the bimodal nature of the DFE in high-fitness (HF) backgrounds. I think this bimodality must be a reflection of the clustering of mutation-background combinations mentioned above. I think the authors ought to draw this connection explicitly. Do all HF backgrounds have a bimodal DFE? What mutations occupy the "moving" peak?

      1e. In several figures, the authors compare the patterns for HF and low-fitness (LF) genotypes. In some cases, there are some stark differences between these two groups, most notably in the shape of the DFE (Figure 7B, C). But there is no discussion about what could underlie these differences. Why are the statistics of epistasis different for HF and LF genotypes? Can the authors at least speculate about possible reasons? Why do HF and LF genotypes have qualitatively different DFEs? I actually don't quite understand why the transition between bimodal DFE in Figure 7B and unimodal DFE in Figure 7C is so abrupt. Is there something biologically special about the threshold that separates LF and HF genotypes? My understanding was that this was just a statistical cutoff. Perhaps the authors can plot the DFEs for all backgrounds on the same plot and just draw a line that separates HF and LF backgrounds so that the reader can better see whether the DFE shape changes gradually or abruptly.

      1f. The analysis of the synonymous mutations is also interesting. However I think a few additional analyses are necessary to clarify what is happening here. I would like to know the extent to which synonymous mutations are more often neutral compared to non-synonymous ones. Then, synonymous pairs interact in the same way as non-synonymous pair (i.e., plot Figure 1 for synonymous pairs)? Do synonymous or non-synonymous mutations that are neutral exhibit less epistasis than non-neutral ones? Finally, do non-synonymous mutations alter epistasis among other mutations more often than synonymous mutations do? What about synonymous-neutral versus synonymous-non-neutral. Basically, I'd like to understand the extent to which a mutation that is neutral in a given background is more or less likely to alter epistasis between other mutations than a non-neutral mutation in the same background.

      (2) I have two related methodological concerns. First, in several analyses, the authors employ thresholds that appear to be arbitrary. And second, I did not see any account of measurement errors. For example, the authors chose the 0.05 threshold to distinguish between epistasis and no epistasis, but why this particular threshold was chosen is not justified. Another example: is whether the product s12 × (s1 + s2) is greater or smaller than zero for any given mutation is uncertain due to measurement errors. Presumably, how to classify each pair of mutations should depend on the precision with which the fitness of mutants is measured. These thresholds could well be different across mutants. We know, for example, that low-fitness mutants typically have noisier fitness estimates than high-fitness mutants. I think the authors should use a statistically rigorous procedure to categorize mutations and their epistatic interactions. I think it is very important to address this issue. I got very concerned about it when I saw on LL 383-388 that synonymous stop codon mutations appear to modulate epistasis among other mutations. This seems very strange to me and makes me quite worried that this is a result of noise in LF genotypes.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      Summary: 

      The idea is appealing, but the authors have not sufficiently demonstrated the utility of this approach.

      Strengths: 

      Novelty of the approach, potential impli=cations for discovering novel interactions

      Weaknesses:

      The Duong had introduced their highly elegant peptidisc approach several years ago. In this present work, they combine it with thermal proteome profiling (TPP) and attempt to demonstrate the utility of this combination for identifying novel membrane protein-ligand interactions.

      While I find this idea intriguing, and the approach potentially useful, I do not feel that the authors had sufficiently demonstrated the utility of this approach. My main concern is that no novel interactions are identified and validated. For the presentation of any new methodology, I think this is quite necessary. In addition, except for MsbA, no orthogonal methods are used to support the conclusions, and the authors rely entirely on quantifying rather small differences in abundances using either iBAQ or LFQ.

      We thank the reviewer for their thoughtful comments. In this revision, we have experimentally addressed the reviewer’s concerns in three ways:

      (1) To demonstrate the utility of our MM-TPP method over the detergent-based TPP workflow (termed DB-TPP), we performed a side-by-side comparison using ATP–VO₄ at 51 °C (Figure 3B and Figure 4A). From the DB-TPP dataset, 7.4% of all identified proteins were annotated as ATP-binding, while 6.4% of proteins differentially stabilized were annotated as ATP-binding. In contrast, in the MM-TPP dataset, 9.3% of all identified proteins were annotated as ATP-binding proteins, while 17% of proteins differentially stabilized were annotated as ATP-binding. The lack of enrichment in the detergent-based approach indicates that the observed differences are likely stochastic, rather than a result of specific ATP–VO₄-mediated stabilization as found with MM-TPP. For instance, several key proteins—BCS1, P2RY6, SLC27A2, ABCB1, ABCC2, and ABCC9— found differentially stabilized using the MM-TPP method showed no such pattern in the DB-TPP dataset. This divergence strongly supports the specificity and utility of our Peptidisc approach. 

      (2) To demonstrate that MM-TPP can resolve not only the broader effects of ATP–VO₄ but also specific ligand–protein interactions, we employed 2-methylthio-ADP (2-MeS-ADP), a selective agonist of the P2RY12 receptor [PMID: 24784220]. In that case, we observed clear thermal stabilization of P2RY12, with more than 6-fold increase in stability at both 51 °C and 57 °C (–log₁₀ p > 5.97; Figure 4B and Figure S4). Notably, no other proteins—including the structurally related but non-responsive P2RY6 receptor- showed comparable stabilization fold change at these temperatures.

      (3) To further probe the reproducibility of the method, we performed an independent MMTPP evaluation with ATP–VO₄ at 51 °C using data-independent acquisition (DIA), in contrast to the data-dependent acquisition (DDA) approach used in the initial study (Figure S5). Overall, 7.8% of all identified proteins were annotated as ATP-binding, and as before, this proportion increased to 17% among proteins with log₂ fold changes greater than 0.5. Specifically, BCS1 and SLC27A2 exhibited strong stabilization (log₂ fold change > 1), while P2RY6, ABCB11, ABCC2, and ABCG2 showed moderate stabilization (log₂ fold changes between 0.5 and 1), and consistent with previous results, P2RX4 was destabilized, with a log₂ fold change below –1. These findings support the consistency and reproducibility of the method across distinct data acquisition methods.

      My main concern is that no novel interactions are identified and validated. For the presentation of any new methodology, I think this is quite necessary.  

      The primary objective of our study is to establish and benchmark the MM-TPP workflow using known targets, rather than to discover novel ligand–protein interactions. Identifying new binders requires extensive screening and downstream validations, which we believe is beyond the scope of this methodological report. Instead, our study highlights the sensitivity and reliability of the MM-TPP approach by demonstrating consistent and reproducible results with well-characterized interactions.

      We respectfully disagree with the notion that introducing a new methodology must necessarily include the discovery of novel interactions. For instance, Martinez Molina et al. [PMID: 23828940] introduced the cellular thermal shift assay (CETSA) by validating established targets such as MetAP2 with TNP-470 and CDK2 with AZD-5438, without identifying novel protein–ligand pairs. Similarly, Kalxdorf et al. [PMID: 33398190] published their cell-surface thermal proteome profiling (CS-TPP) using Ouabain to stabilize the Na⁺/K⁺-ATPase pump in K562 cells, and SB431542 to stabilize its canonical target JAG1. In fact, when these methods revealed additional stabilizations, these were not validated but instead interpreted through reasoning grounded in the literature. For instance, they attributed the SB431542-induced stabilization of MCT1 to its reported role in cell migration and tumor invasiveness, and explained that SLC1A2 stabilization is related to the disruption of Na⁺/K⁺-ATPase activity by Ouabain. In the same way, our interpretation of ATP-VO₄–mediated stabilization of Mao-B is justified by predictive AlphaFold-3 rather than direct orthogonal assays, which are beyond the scope of our methodological presentation. 

      Collectively, the influential studies cited above have set methodological precedents by prioritizing validation and proof-of-concept over merely finding uncharacterized binders. In the same spirit, our work is centred on establishing MM-TPP as a robust platform for probing membrane protein–ligand interactions in a water-soluble format. The discovery of novel binders remains an exciting future direction—one that will build upon the methodological foundation laid by the present study.

      In addition, except for MsbA, no orthogonal methods are used to support the conclusions, and the authors rely entirely on quantifying rather small differences in abundances using either iBAQ or LFQ.

      We deliberately began this study with our model protein, MsbA, examined under both native and overexpressed conditions, to establish an adequation between MMTPP (Figure 2D) and biochemical stability assays (Figure 2A). This validation has provided us with the foundation to confidently extend MM-TPP to the mouse organ proteome. To demonstrate the validity of our workflow, we have used ATP-VO₄ because it has expected targets. 

      We note that orthogonal validation often requires overproduction and purification of the candidate proteins, including suitable antibodies, which is a true challenge for membrane proteins. Here, we demonstrate that MM-TPP can detect ligand-induced thermal shifts directly in native membrane preparations, without requiring protein overproduction or purification. We also emphasize several influential studies in TPP, including Martinez Molina et al. (PMID: 23828940) and Fang et al. (PMID: 34188175), which focused primarily on establishing and benchmarking the methodology, rather than on extensive orthogonal validation. In the same spirit, our study prioritizes methodological development, and accordingly, several orthogonal validations are now included in this revision.

      [...] and the authors rely entirely on quantifying rather small differences in abundances using either iBAQ or LFQ.

      To clarify, all analyses on ligand-induced stabilization or destabilization were carried out using LFQ values. The sole exception is on Figure 2B, where we used iBAQ values to depict the relative abundance of proteins within a single sample; this to show MsbA's relative level within the E. coli peptidisc library.

      Respectfully, we disagree with the assertion that we are “quantifying rather small differences in abundances using either iBAQ or LFQ.” We were able to clearly distinguish between stabilizations driven by specific ligands binding to their targets versus those caused by non-specific ligands with broader activity. This is further confirmed by comparing 2-MeS-ADP, a selective ligand for P2RY12, with ATP-VO₄, a highly promiscuous ligand, and AMP-PNP, which exhibits intermediate breadth. When tested in triplicate at 51 °C, 2-MeS-ADP significantly altered the thermal stability of 27 proteins,  AMP-PNP 44 proteins, and ATP-VO₄ 230 proteins, consistent with the expectation that broader ligands stabilize more proteins nonspecifically. Importantly, 2-MeS-ADP produced markedly stronger stabilization of its intended target, P2RY12 (–log<sub>10</sub>p = 9.32), than the top stabilized proteins for ATP–VO₄ (DNAJB3, –log₁₀p = 5.87) or AMP-PNP (FTH1, p = 5.34). Moreover, 2-MeS-ADP did not significantly stabilize proteins that were consistently stabilized by the broad ligands, such as SLC27A2, which was strongly stabilized by both ATP-VO<sub>4</sub> and AMP-PNP (–log<sub>10</sub> p>2.5). Together, these findings demonstrate that MMTPP can robustly distinguish between broad-spectrum and target-specific ligands, with selective ligands inducing stronger and more physiologically meaningful stabilization at their intended targets compared to promiscuous ligands.

      Finally, we emphasize that our findings are not marginal, but meet quantitative and statistical rigor consistent with best practices in proteomics. We apply dual thresholds combining effect size (|log₂FC| ≥ 1, i.e., at least a two-fold change) with statistical significance (FDR-adjusted p ≤ 0.05)—criteria commonly used in proteomics methodology studies (e.g., PMID: 24942700, 38724498). Moreover, the stabilization and destabilization events we report are reproducible across biological replicates (n = 3), consistent across adjacent temperatures for most targets, and technically robust across acquisition modes (DDA vs. DIA). Taken together, these results reflect statistically valid and biologically meaningful effects, fully aligned with standards set by prior published proteomics studies.

      Furthermore, the reported changes in abundances are solely based on iBAQ or LFQ analysis. This must be supported by a more quantitative approach such as SILAC or labeled peptides. In summary, I think this story requires a stronger and broader demonstration of the ability of peptidisc-TPP to identify novel physiologically/pharmacologically relevant interactions.

      With respect to labeling strategies, we deliberately avoided using TMT due to concerns about both cost and potential data quality issues. Some recent studies have documented the drawbacks of TMT in contexts directly relevant to our work. For example, a benchmarking study of LiP-MS workflows showed that although TMT increased proteome depth and reduced technical variance, it was less accurate in identifying true drug–protein interactions and produced weaker dose–response correlations compared with label-free DIA approaches [PMID: 40089063]. More broadly, technical reviews have highlighted that isobaric tagging is intrinsically prone to ratio compression and reporterion interference due to co-isolation and co-fragmentation of peptides, which flatten measured fold-changes and obscure biologically meaningful differences [PMID: 22580419, 22036744]. In terms of SILAC, the technique requires metabolic incorporation of heavy amino acids, which is feasible in cultured cells but not in physiologically relevant tissues such as the liver organ used here. SILAC mouse models exist, but they are expensive and time-consuming [PMID: 18662549, 21909926]. We are not a mouse lab, and introducing liver organ SILAC labeling in our workflow is beyond the scope of these revisions. We also note that several hallmark TPP studies have been successfully carried out using label-free quantification [PMID: 25278616, 26379230, 33398190, 23828940], establishing this as an accepted and widely applied approach in the field. 

      To further support our conclusions, we added controls showing that detergent solubilization of mouse liver membranes followed by SP4 cleanup fails to detect ATP-VO₄– mediated stabilization of ATP-binding proteins, underscoring the necessity of Peptidisc reconstitution for capturing ligand-induced thermal stabilization. We also present new data demonstrating selective stabilization of the P2Y12 receptor by its agonist 2-MeS-ADP, providing orthogonal, receptor-specific validation within the MM-TPP framework. Finally, an orthogonal DIA acquisition on separate replicates confirmed robust ATP-vanadate stabilization of ATP-binding proteins, including BCS1l and SLC27A2. Together, these additions reinforce that the observed stabilizations are genuine, physiologically relevant ligand–protein interactions and highlight the unique advantage of the Peptidisc-based workflow in capturing such events.

      Cited Reference:

      24784220: Zhang J, Zhang K, Gao ZG, et al. Agonist-bound structure of the human P2Y₁₂ receptor. Nature.  2014;509(7498):119-122. doi:10.1038/nature13288. 

      23828940: Martinez Molina D, Jafari R, Ignatushchenko M, et al. Monitoring drug target engagement in cells and tissues using the cellular thermal shift assay. Science. 2013;341(6141):84-87. doi:10.1126/science.1233606.

      33398190: Kalxdorf M, Günthner I, Becher I, et al. Cell surface thermal proteome profiling tracks perturbations and drug targets on the plasma membrane. Nat Methods. 2021;18(1):84-91. doi:10.1038/s41592-020-01022-1.

      34188175: Fang S, Kirk PDW, Bantscheff M, Lilley KS, Crook OM. A Bayesian semi-parametric model for thermal proteome profiling. Commun Biol. 2021;4(1):810. doi:10.1038/s42003-021-02306-8.

      24942700: Cox J, Hein MY, Luber CA, Paron I, Nagaraj N, Mann M. Accurate proteome-wide label-free quantification by delayed normalization and maximal peptide ratio extraction, termed MaxLFQ. Mol Cell Proteomics. 2014;13(9):2513-2526. doi:10.1074/mcp.M113.031591.

      38724498: Peng H, Wang H, Kong W, Li J, Goh WWB. Optimizing differential expression analysis for proteomics data via high-performing rules and ensemble inference. Nat Commun. 2024;15(1):3922. doi:10.1038/s41467-02447899-w. 

      40089063: Koudelka T, Bassot C, Piazza I. Benchmarking of quantitative proteomics workflows for limited proteolysis mass spectrometry. Mol Cell Proteomics. 2025;24(4):100945. doi:10.1016/j.mcpro.2025.100945.

      22580419: Christoforou AL, Lilley KS. Isobaric tagging approaches in quantitative proteomics: the ups and downs. Anal Bioanal Chem. 2012;404(4):1029-1037. doi:10.1007/s00216-012-6012-9. 

      22036744: Christoforou AL, Lilley KS. Isobaric tagging approaches in quantitative proteomics: the ups and downs. Anal Bioanal Chem. 2012;404(4):1029-1037. doi:10.1007/s00216-012-6012-9. 

      18662549: Krüger M, Moser M, Ussar S, et al. SILAC mouse for quantitative proteomics uncovers kindlin-3 as an essential factor for red blood cell function. Cell. 2008;134(2):353-364. doi:10.1016/j.cell.2008.05.033.

      21909926: Zanivan S, Krueger M, Mann M. In vivo quantitative proteomics: the SILAC mouse. Methods Mol Biol. 2012;757:435-450. doi:10.1007/978-1-61779-166-6_25. 

      25278616: Kalxdorf M, Becher I, Savitski MM, et al. Temperature-dependent cellular protein stability enables highprecision proteomics profiling. Nat Methods. 2015;12(12):1147-1150. doi:10.1038/nmeth.3651.

      26379230: Savitski MM, Reinhard FBM, Franken H, et al. Tracking cancer drugs in living cells by thermal profiling of the proteome. Science. 2015;346(6205):1255784. doi:10.1126/science.1255784. 

      33452728: Leuenberger P, Ganscha S, Kahraman A, et al. Cell-wide analysis of protein thermal unfolding reveals determinants of thermostability. Science. 2020;355(6327):eaai7825. doi:10.1126/science.aai7825. 

      23066101: Savitski MM, Zinn N, Faelth-Savitski M, et al. Quantitative thermal proteome profiling reveals ligand interactions and thermal stability changes in cells. Nat Methods. 2013;10(12):1094-1096. doi:10.1038/nmeth.2766.  

      30858367: Piazza I, Kochanowski K, Cappelletti V, et al. A machine learning-based chemoproteomic approach to identify drug targets and binding sites in complex proteomes. Nat Commun. 2019;10(1):1216. doi:10.1038/s41467019-09199-0. 

      Reviewer #2 (Public Review):

      Summary:

      The membrane mimetic thermal proteome profiling (MM-TPP) presented by Jandu et al. seems to be a useful way to minimize the interference of detergents in efficient mass spectrometry analysis of membrane proteins. Thermal proteome profiling is a mass spectrometric method that measures binding of a drug to different proteins in a cell lysate by monitoring thermal stabilization of the proteins because of the interaction with the ligands that are being studied. This method has been underexplored for membrane proteome because of the inefficient mass spectrometric detection of membrane proteins and because of the interference from detergents that are used often for membrane protein solubilization.

      Strengths:

      In this report the binding of ligands to membrane protein targets has been monitored in crude membrane lysates or tissue homogenates exalting the efficacy of the method to detect both intended and off-target binding events in a complex physiologically relevant sample setting.

      The manuscript is lucidly written and the data presented seems clear. The only insignificant grammatical error I found was that the 'P' in the word peptidisc is not capitalized in the beginning of the methods section "MM-TPP profiling on membrane proteomes". The clear writing made it easy to understand and evaluate what has been presented. Kudos to the authors.

      Weaknesses:

      While this is a solid report and a promising tool for analyzing membrane protein drug interactions, addressing some of the minor caveats listed below could make it much more impactful.

      The authors claim that MM-TPP is done by "completely circumventing structural perturbations invoked by detergents[1] ". This may not be entirely accurate, because before reconstitution of the membrane proteins in peptidisc, the membrane fractions are solubilized by 1% DDM. The solubilization and following centrifugation step lasts at least for 45 min. It is less likely that all the structural perturbations caused by DDM to various membrane proteins and their transient interactions become completely reversed or rescued by peptidisc reconstitution.

      We thank the reviewer for this insightful comment. In response, we have revised the sentence and expanded the discussion to clarify that the Peptidisc provides a complementary approach to detergent-based preparations for studying membrane proteins, preserving native lipid–protein interactions and stabilization effects that may be diminished in detergent.

      To further address the structural perturbations invoked by detergents, and as already detailed to our response to Reviewer 1, we have compared the thermal profile of the Peptidisc library to the mouse liver membranes solubilized with 1% DDM, after incubation with ATP–VO₄ at 51 °C (Figure 4A). The results with the detergent extract revealed random patterns of stabilization and destabilization, with only 6.4% of differentially stabilized proteins being ATP-binding—comparable to the 7.4% observed in the background. In contrast, in the Peptidisc library, 17% of differentially stabilized proteins were ATP-binding, compared to 9.3% in the background. Thus, while Peptidisc reconstitution does not fully avoid initial detergent exposure, these findings underscore the importance of implementing Peptidisc in the TPP workflow when dealing with membrane proteins.

      In the introduction, the authors make statements such as "..it is widely acknowledged that even mild detergents can disrupt protein structures and activities, leading to challenges in accurately identifying drug targets.." and "[peptidisc] libraries are instrumental in capturing and stabilizing IMPs in their functional states while preserving their interactomes and lipid allosteric modulators...'. These need to be rephrased, as it has been shown by countless studies that even with membrane protein suspended in micelles robust ligand binding assays and binding kinetics have been performed leading to physiologically relevant conclusions and identification of protein-protein and protein-ligand interactions.

      We thank the reviewer for this valuable feedback and fully agree with the point raised. In response, we have revised the Introduction and conclusion to moderate the language concerning the limitations of detergent use. We now explicitly acknowledge that numerous studies have successfully used detergent micelles for ligand-binding assays and kinetic analyses, yielding physiologically relevant insights into both protein–protein and protein–ligand interactions [e.g., PMID: 22004748, 26440106, 31776188].

      At the same time, we clarify that the Peptidisc method offers a complementary advantage, particularly in the context of thermal proteome profiling (TPP), which involves mass spectrometry workflows that are incompatible with detergents. In this setting, Peptidiscs facilitate the detection of ligand-binding events that may be more difficult to observe in detergent micelles.

      We have reframed our discussion accordingly to present Peptidiscs not as a replacement for detergent-based methods, but rather as a complementary tool that broadens the available methodological landscape for studying membrane protein interactions.

      If the method involves detergent solubilization, for example using 1% DDM, it is a bit disingenuous to argue that 'interactomes and lipid allosteric modulators' characterized by lowaffinity interactions will remain intact or can be rescued upon detergent removal. Authors should discuss this or at least highlight the primary caveat of the peptidisc method of membrane protein reconstitution - which is that it begins with detergent solubilization of the proteome and does not completely circumvent structural perturbations invoked by detergents.

      We would like to clarify that, in our current workflow, ligand incubation occurs after reconstitution into Peptidiscs. As such, the method is designed to circumvent the negative effects of detergent during the critical steps involving low-affinity interactions.

      That said, we fully acknowledge that Peptidisc reconstitution begins with detergent solubilization (e.g., 1% DDM), and we have revised the conclusion to explicitly state this important caveat. As the reviewer correctly points out, this initial step may introduce some structural perturbations or result in the loss of weakly associated lipid modulators.

      However, reconstitution into Peptidiscs rapidly restores a detergent-free environment for membrane proteins, which has been shown in our previous studies [PMID: 38577106, 38232390, 31736482, 31364989] to mitigate these effects. Specifically, we have demonstrated that time-limited DDM exposure, followed by Peptidisc reconstitution, minimizes membrane protein delipidation, enhances thermal stability, retains functionality, and preserves multi-protein assemblies.

      It would also be important to test detergents that are even milder than 1% DDM and ones which are harsher than 1% DDM to show that this method of reconstitution can indeed rescue the perturbations to the structure and interactions of the membrane protein done by detergents during solubilization step. 

      We selected 1% DDM based on our previous work [PMID: 37295717, 39313981,38232390], where it consistently enabled robust and reproducible solubilization for Peptidisc reconstitution. We agree that comparing milder detergents (e.g., LMNG) and harsher ones (e.g., SDC) would provide valuable insights into how detergent strength influences structural perturbations, and how effectively these can be mitigated by Peptidisc reconstitution. Preliminary data (not shown) from mouse liver membranes indicate broadly similar proteomic profiles following solubilization with DDM, LMNG, and SDC, although potential differences in functional activity or ligand binding remain to be investigated.

      Based on the methods provided, it appears that the final amount of detergent in peptidisc membrane protein library was 0.008%, which is ~150 uM. The CMC of DDM depending on the amount of NaCl could be between 120-170 uM.

      While we cannot entirely rule out the presence of residual DDM (0.008%) in the raw library, its free concentration may be lower than initially estimated. This is related to the formation of mixed micelles with the amphipathic peptide scaffold, which is supplied in excess during reconstitution. These mixed micelles are subsequently removed during the ultrafiltration step. Furthermore, in related work using His-tagged Peptidiscs [PMID: 32364744], we purified the library by nickel-affinity chromatography following a 5× dilution into a detergent-free buffer. Although this purification step reduced the number of soluble proteins, the same membrane proteins were retained, suggesting that any residual detergent does not significantly interfere with Peptidisc reconstitution. Supporting this, our MM-TPP assays on purified libraries (data not shown) consistently demonstrated stabilization of ATP-binding proteins (e.g., SLC27A2, DNAJB3), indicating that the observed ligand–protein interactions result from successful incorporation into Peptidiscs.

      Perhaps, to completely circumvent the perturbations from detergents other methods of detergentfree solubilization such as using SMA polymers and SMALP reconstitution could be explored for a comparison. Moreover, a comparison of the peptidisc reconstitution with detergent-free extraction strategies, such as SMA copolymers, could lend more strength to the presented method.

      We agree that detergent-free methods such as SMA polymers hold promise for membrane protein solubilization. However, in preliminary single-replicate experiments using SMA2000 at 51 °C in the presence of ATP–VO₄ (data not shown), we observed broad, non-specific stabilization effects. Of the 2,287 quantified proteins, 9.3% were annotated as ATP-binding, yet 9.9% of the 101 proteins showing a log₂ fold change >1 or <–1 were ATPbinding, indicating no meaningful enrichment. Given this lack of specificity and the limited dataset, we chose not to pursue further SMA experiments and have not included them here. However, in a recent study (https://doi.org/10.1101/2025.08.25.672181), we directly compared Peptidisc, SMA, and nanodiscs for liver membrane proteome profiling. In that work, Peptidisc outperformed both SMA and nanodiscs in detecting membrane protein dysregulation between healthy and diseased liver. By extension, we expect Peptidisc to offer superior sensitivity and specificity for detecting ligand-induced stabilization events, such as those observed here with ATP–vanadate.

      Cross-verification of the identified interactions, and subsequent stabilization or destabilizations, should be demonstrated by other in vitro methods of thermal stability and ligand binding analysis using purified protein to support the efficacy of the MM-TPP method. An example cross-verification using SDS-PAGE, of the well-studied MsbA, is shown in Figure 2. In a similar fashion, other discussed targets such as, BCS1L, P2RX4, DgkA, Mao-B, and some un-annotated IMPs shown in supplementary figure 3 that display substantial stabilization or destabilization should be cross-verified.

      We appreciate this suggestion and note that a similar point was raised in R1’s comment “In addition, except for MsbA, no orthogonal methods are used to support the conclusions, and the authors rely entirely on quantifying rather small differences in abundances using either iBAQ or LFQ.” We have developed a detailed response to R1 on this matter, which equally applies here. 

      Cited Reference:

      35616533: Young JW, Wason IS, Zhao Z, et al. Development of a Method Combining Peptidiscs and Proteomics to Identify, Stabilize, and Purify a Detergent-Sensitive Membrane Protein Assembly. J Proteome Res. 2022;21(7):1748-1758. doi:10.1021/acs.jproteome.2c00129. PMID: 35616533.

      31364989: Carlson ML, Stacey RG, Young JW, et al. Profiling the Escherichia coli membrane protein interactome captured in Peptidisc libraries. Elife. 2019;8:e46615. doi:10.7554/eLife.46615. 

      22004748: O'Malley MA, Helgeson ME, Wagner NJ, Robinson AS. Toward rational design of protein detergent complexes: determinants of mixed micelles that are critical for the in vitro stabilization of a G-protein coupled receptor. Biophys J. 2011;101(8):1938-1948. doi:10.1016/j.bpj.2011.09.018.

      26440106: Allison TM, Reading E, Liko I, Baldwin AJ, Laganowsky A, Robinson CV. Quantifying the stabilizing effects of protein-ligand interactions in the gas phase. Nat Commun. 2015;6:8551. doi:10.1038/ncomms9551.

      31776188: Beckner RL, Zoubak L, Hines KG, Gawrisch K, Yeliseev AA. Probing thermostability of detergentsolubilized CB2 receptor by parallel G protein-activation and ligand-binding assays. J Biol Chem. 2020;295(1):181190. doi:10.1074/jbc.RA119.010696.

      38577106: Jandu RS, Yu H, Zhao Z, Le HT, Kim S, Huan T, Duong van Hoa F. Capture of endogenous lipids in peptidiscs and effect on protein stability and activity. iScience. 2024;27(4):109382. doi:10.1016/j.isci.2024.109382.

      38232390: Antony F, Brough Z, Zhao Z, Duong van Hoa F. Capture of the Mouse Organ Membrane Proteome Specificity in Peptidisc Libraries. J Proteome Res. 2024;23(2):857-867. doi:10.1021/acs.jproteome.3c00825.

      31736482: Saville JW, Troman LA, Duong Van Hoa F. PeptiQuick, a one-step incorporation of membrane proteins into biotinylated peptidiscs for streamlined protein binding assays. J Vis Exp. 2019;(153). doi:10.3791/60661. 

      37295717: Zhao Z, Khurana A, Antony F, et al. A Peptidisc-Based Survey of the Plasma Membrane Proteome of a Mammalian Cell. Mol Cell Proteomics. 2023;22(8):100588. doi:10.1016/j.mcpro.2023.100588. 

      39313981: Antony F, Brough Z, Orangi M, Al-Seragi M, Aoki H, Babu M, Duong van Hoa F. Sensitive Profiling of Mouse Liver Membrane Proteome Dysregulation Following a High-Fat and Alcohol Diet Treatment. Proteomics. 2024;24(23-24):e202300599. doi:10.1002/pmic.202300599. 

      32364744: Young JW, Wason IS, Zhao Z, Rattray DG, Foster LJ, Duong Van Hoa F. His-Tagged Peptidiscs Enable Affinity Purification of the Membrane Proteome for Downstream Mass Spectrometry Analysis. J Proteome Res. 2020;19(7):2553-2562. doi:10.1021/acs.jproteome.0c00022.

      32591519: The M, Käll L. Focus on the spectra that matter by clustering of quantification data in shotgun proteomics. Nat Commun. 2020;11(1):3234. doi:10.1038/s41467-020-17037-3. 

      33188197: Kurzawa N, Becher I, Sridharan S, et al. A computational method for detection of ligand-binding proteins from dose range thermal proteome profiles. Nat Commun. 2020;11(1):5783. doi:10.1038/s41467-02019529-8. 

      26524241: Reinhard FBM, Eberhard D, Werner T, et al. Thermal proteome profiling monitors ligand interactions with cellular membrane proteins. Nat Methods. 2015;12(12):1129-1131. doi:10.1038/nmeth.3652. 

      23828940: Martinez Molina D, Jafari R, Ignatushchenko M, et al. Monitoring drug target engagement in cells and tissues using the cellular thermal shift assay. Science. 2013;341(6141):84-87. doi:10.1126/science.1233606. 

      32133759: Mateus A, Kurzawa N, Becher I, et al. Thermal proteome profiling for interrogating protein interactions. Mol Syst Biol. 2020;16(3):e9232. doi:10.15252/msb.20199232. 

      14755328: Dorsam RT, Kunapuli SP. Central role of the P2Y12 receptor in platelet activation. J Clin Invest. 2004;113(3):340-345. doi:10.1172/JCI20986. 

      Reviewer #1 (Recommendations for the authors):

      “The authors use iBAC or LFQ to compare across samples. This inconsistency is puzzling. As far as I know, LFQ should always be used when comparing across samples”

      As mentioned above, we use iBAQ only in Fig. 2B to illustrate within-sample relative abundance; all comparative analyses elsewhere use LFQ. We have updated the Fig. 2B legend to state this explicitly.

      We used iBAQ Fig. 2B as it provides a notion of protein abundance within a sample, normalizing the summed peptide intensities by the number of theoretically observable peptides. This normalization facilitates comparisons between proteins within the same sample, offering a clearer understanding of their relative molar proportions [PMID: 33452728]. LFQ, by contrast, is optimized for comparing the same protein across different samples. It achieves this by performing delayed normalization to reduce run-to-run variability and by applying maximal peptide ratio extraction, which integrates pairwise peptide intensity ratios across all samples to build a consistent protein-level quantification matrix [PMID: 24942700]. These features make LFQ more robust to missing values and technical variation, thereby enabling accurate detection of relative abundance changes in the same protein under different experimental conditions. This distinction is well supported by the proteomics literature: Smits et al. [PMID: 23066101] used iBAQ specifically to determine the relative abundance of proteins within one sample, whereas LFQ was applied for comparative analyses between conditions.

      “[Regarding Figure 2A] Why does the control also contain ATP-vanadate? Also, I am not aware of a commercially available chemical "ATP-VO4". I assume this is a mistake”

      The control condition in Figure 2A was mislabeled, and the figure has been corrected to remove this discrepancy. In our experiments, ATP and orthovanadate (VO<sub>4</sub>) were added together, and for simplicity this was annotated as “ATP-VO<sub>4</sub>.” 

      “[Regarding Figure 2B] What is the fold change in MsbA iBAQ values? It seems that the differences are quite small, and as such require a more quantitative approach than iBAQ (e.g SILAC or some other internal standard). In addition, what information does this panel add relative to 2C”

      The figure has been updated to clarify that the values shown are log₂transformed iBAQ intensities. Figures 2B and 2C are complementary: Figure 2B shows that in the control sample, MsbA’s peptide abundance decreases with temperatures (51, 56, and 61 °C) relative to the remaining bulk proteins. Figure 2C shows the specific thermal profiles of MsbA in control and ATP–vanadate conditions. To make this clearer, we have added a sentence to the Results section explaining the specific role of Figure 2B.

      Together, these panels indicate that the method can identify ligand-induced stabilization even for proteins whose abundance decreases faster than the bulk during the TPP assay. We have provided the rationale for not using SILAC or TMT labeling in our public response.

      “[Regarding Figure 2C] Although not mentioned in the legend, I assume this is iBAQ quantification, which as mentioned above isn't accurate enough for such small differences. In addition, I find this data confusing: why is MsbA more stable at the lower temperatures in the absence of ATP-vanadate? The smoothed-line representation is misleading, certainly given the low number of data points”

      The data presented represent LFQ values for MsbA, and we have updated the figure legend to clearly indicate this. Additionally, as suggested, we have removed the smoothing line to more accurately reflect the data. Regarding the reviewer’s concern about stability at lower temperatures, we note that MsbA exhibits comparable abundance at 38 °C and 46 °C under both conditions, with overlapping error bars. We therefore interpret these data as indicating no significant difference in stability at the lower temperatures, with ligand-dependent stabilization becoming apparent only at elevated temperatures. We do not exclude the possibility that MsbA stability at these temperatures is affected by the conformational dynamics of this ABC transporter upon ATP binding and hydrolysis.

      “[Regarding Figure 3A] is this raw LFQ data? Why did the authors suddenly change from iBAQ to LFQ? I find this inconsistency puzzling”

      To clarify, all analyses of protein stabilization or destabilization presented in the manuscript are based on LFQ values. The only instance where iBAQ was used is Figure 2B, where it served to illustrate the relative peptide abundance of MsbA within the same sample. We have revised the figure legends and text to make this distinction explicit and ensure consistency in presentation.

      “[Regarding Figure 3B] The non-specific ATP-dependent stabilization increases the likelihood of false positive hits. This limitation is not mentioned by the authors. I think it is important to show other small molecules, in addition to ATP. The authors suggest that their approach is highly relevant for drug screening. Therefore, a good choice is to test an effect of a known stabilizing drug (eg VX-809 and CFTR)”

      We thank the reviewer for this suggestion. As noted in the manuscript (results and discussion sections), ATP is a natural hydrotrope and is therefore expected to induce broad, non-specific stabilization effects, a phenomenon also observed in previous proteome-wide studies, which demonstrated ATP’s widespread influence on cytosolic protein solubility and thermal stability (PMID: 30858367). To demonstrate that MM-TPP can resolve specific ligand–protein interactions beyond these global ATP effects, we tested 2-methylthio-ADP (2-MeS-ADP), a selective agonist of P2RY12 (PMID: 14755328). In these experiments, we observed robust and reproducible stabilization of P2RY12 at both 51°C and 57°C, with no consistent stabilization of unrelated proteins across temperatures. This provides direct evidence that our workflow can distinguish specific from non-specific ligand-induced effects. We selected 2-MeS-ADP due to its structural stability and receptor higher-affinity over ADP, allowing us to extend our existing workflow while testing a receptor-specific interaction. We agree that extending this approach to clinically relevant small-molecule drugs, such as VX-809 with CFTR, would further underscore the pharmacological potential of MM-TPP, and we have now noted this as an important avenue for future studies.

      “X axis of Figure 3B: Log 2 fold difference of what? iBAQ? LFQ? Similar ambiguity regarding the Y axis of 3E. What peptide? And why the constant changes in estimating abundances?”

      We thank the reviewer for pointing out these inaccuracies in the figure annotations. As mentioned above, all analyses (except Figure 2B) are based on LFQ values. We have revised the figure legends and text to make this clear.

      In Figure 3E, “peptide intensity” refers to log2 LFQ peptide intensities derived from the BCS1L protein, as indicated in the figure caption. 

      “The authors suggest that P2RY6 and P2RY12 are stabilized by ADP, the hydrolysis product of ATP. Currently, the support for this suggestion is highly indirect. To support this claim, the authors need to directly show the effect of ADP. In reference to the alpha fold results shown in Figure 4D, the authors state that "Collectively, these data highlight the ability of MM-TPP to detect the side effects of parent compounds, an important consideration for drug development". To support this claim, it is necessary to show that Mao-B is indeed best stabilized with ADP or AMP, rather than ATP.”

      In this revision, we chose not to test ADP directly, as it is a broadly binding, relatively weak ligand that would likely stabilize many proteins without revealing clear target-specific effects. Since we had already evaluated ATP-VO₄, a similarly broad, non-specific ligand, additional testing with ADP would provide limited additional insight. Instead, we prioritized 2-methylthio-ADP, a selective agonist of P2RY12, to more effectively demonstrate the specificity of MM-TPP. With this ligand, we observed clear and reproducible stabilization of P2RY12, underscoring the ability of MM-TPP to resolve receptor–ligand interactions beyond ATP’s broad hydrotropic effects. Importantly, and as expected, we did not observe stabilization of the related purinergic receptor P2RY6, further supporting the specificity of the observed effect.

      We have also revised the AlphaFold-related statement in Figure 4D to adopt a more cautious tone: “Collectively, these data suggest that MM-TPP may detect potential side effects of parent compounds, an important consideration for drug development.” In this context, we use AlphaFold not as a validation tool, but rather as a structural aid to help rationalize why certain off-target proteins (e.g., ATP with Mao-B) exhibit stabilization.

      Reviewer #2 (Recommendations for the authors):

      “In the main text, it will be useful to include the unique peptides table of at least the targets discussed in the manuscript. For example, in presence of AMP-PNP at 51oC P2RY6 shows 4-6 peptides in all n=3 positive & negative ionization modes. But, for P2RY12 only 1-3 peptides were observed. Depending on the sequence length and the relative abundance in the cell of a protein of interest, the number of peptides observed could vary a lot per protein. Given the unique peptide abundance reported in the supplementary file, for various proteins in different conditions, it appears the threshold of observation of two unique peptides for a protein to be analyzed seems less stringent.”

      By applying a filter requiring at least two unique peptides in at least one replicate, we exclude, on average, 15–20% of the total identified proteins. We consider this a reasonable level of stringency that balances confidence in protein identification with the retention of relevant data. This threshold was selected because it aligns with established LC-MS/MS data analysis practices (PMID: 32591519, 33188197, 26524241), and we have included these references in the Methods section to justify our approach. We have included in this revision a Supplemental Table 2 showing the unique peptide counts for proteins highlighted in this study.  

      “It appears that the time of heat treatment for peptidisc library subjected to MM-TPP profiling was chosen as 3 min based on the results presented in Supplementary Figure 1A, especially the loss of MsbA observed in 1% DDM after 3 min heat perturbation. However, when reconstituted in peptidisc there seems to be no loss in MsbA even after 12 mins at 45oC. So, perhaps a longer heat treatment would be a more efficient perturbation.”

      Previous studies indicate that heat exposure of 3–5 minutes is optimal for visualizing protein denaturation (PMID: 23828940, 32133759). We have added a statement to the Results section to justify our choice of heat exposure. Although MsbA remains stable at 45 °C for extended periods, higher temperatures allow for more effective perturbation to reveal destabilization. Supplementary Figure 1A specifically illustrates MsbA instability in detergent environments.

      “Some of the stabilized temperatures listed in Table 1 are a bit confusing. For example, ABCC3 and ABCG2. In the case of ABCC3 stabilization was observed at 51oC and 60oC, but 56oC is not mentioned. In the same way, 51oC is not mentioned for ABCG2. You would expect protein to be stabilized at 56oC if it is stabilized at both 51oC and 60oC. So, it is unclear if the stabilizations were not monitored for these proteins at the missing temperatures in the table or if no peptides could be recorded at these temperatures as in the case of P2RX4 at 60oC in Figure 4C.”

      Both scenarios are represented in our data. For some proteins, like ABCG2, sufficient peptide coverage was achieved, but no stabilization was observed at intermediate temperatures (e.g., 56 °C), likely because the perturbation was not strong enough to reveal an effect. In other cases, such as ABCC3 at 56 °C or P2RX4 at 60 °C, the proteins were not detected due to insufficient peptide identifications at those temperatures, which explains their omission from the table. 

      “In Figure 4C, it is perplexing to note that despite n = 3 there were no peptide fragments detected for P2RX4 at 60oC in presence of ATP-VO4, but they were detected in presence of AMP-PNP. It will be useful to learn authors explanation for this, especially because both of these ligands destabilize P2RX4. In Figure 4B, it would have been great to see the effect of ADP too, to corroborate the theory that ATP metabolites could impact the thermal stability.”

      In Figure 4C, the absence of P2RX4 peptide detection at 60 °C with ATP–VO₄ mirrors variability observed in the corresponding control (n = 6). Specifically, neither the control nor ATP–VO₄ produced unique peptides for P2RX4 at 60 °C in that replicate, whereas peptides were detected at 60 °C in other replicates for both the control and AMPPNP, and at 64 °C for ATP–VO<sub>4</sub>, the controls, and AMP-PNP. Such missing values are a natural feature of MS-based proteomics and can arise from multiple technical factors, including inconsistent heating, incomplete digestion, stochastic MS injection, or interference from Peptidisc peptides. We therefore interpret the absence of peptides in this replicate as a technical artifact rather than evidence against protein destabilization. Importantly, the overall dataset consistently shows that both ATP–VO₄ and AMP-PNP destabilize P2RX4, supporting their characterization as broad, non-specific ligands with off-target effects.

      Because ATP and ADP belong to the same class of broadly binding, non-specific ligands, additional testing with ADP would not provide meaningful mechanistic insight. Instead, we chose to test 2-methylthio-ADP, a selective P2RY12 agonist. This experiment revealed robust, reproducible stabilization of P2RY12, without consistent effects on unrelated proteins at 51 °C and 57 °C, thereby demonstrating the ability of MM-TPP to detect specific receptor–ligand interactions.

      Finally, we note that P2RX4 is not a primary target of ATP–VO<sub>4</sub> or AMP-PNP. Consequently, the observed destabilization of P2RX4 is expected to be less pronounced than the strong, physiologically consistent stabilization of ABC transporters by ATP–VO<sub>4</sub>, as shown in Figure 3D, where the majority of ABC transporters are thermally stabilized across all tested temperatures.

      “As per Figure 4, P2Y receptors P2RY6 and P2RY12 both showed great thermal stability in presence of ATP-VO4 despite their preference for ADP. The authors argue this could be because of ATP metabolism, and binding of the resultant ADP to the P2RY6. If P2RX4 prefers ATP and not the metabolized product ADP that apparently is available, ideally you should not see a change in stability. A stark destabilization would indicate interaction of some sorts. P2X receptors are activated by ATP and are not naturally activated by AMP-PNP. So, destabilization of P2RX4 upon binding to ATP that can activate P2X receptors is conceivable. However, destabilization both in presence of ATP-VO4 and AMP-PNP is unclear. It is perhaps useful to test effect of ADP using this method, and maybe even compare some antagonists such as TNPATP.”

      In this study, we did not directly test ADP, as we had already demonstrated that MM-TPP detects stabilization by broad-binding ligands such as ATP–VO₄. Instead, we focused on a more selective ligand, 2-MeS-ADP, a specific agonist of P2RY12 [PMID: 14755328]. Here, we observed robust and reproducible stabilization of P2RY12 at 51 °C and 57 °C, while P2RY6 showed no significant changes, and no other proteins were consistently stabilized (Figure 4B, S4). This confirms that MM-TPP can distinguish specific ligand–receptor interactions from broader ATP-induced effects. To further explore the assay’s nuance and sensitivity, testing additional nucleotide ligands—including antagonists like TNP-ATP or ATPγS—would provide valuable insights, and we have identified this as an important future direction.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      Summary:

      The paper presents a model for sequence generation in the zebra finch HVC, which adheres to cellular properties measured experimentally. However, the model is fine-tuned and exhibits limited robustness to noise inherent in the inhibitory interneurons within the HVC, as well as to fluctuations in connectivity between neurons. Although the proposed microcircuits are introduced as units for sub-syllabic segments (SSS), the backbone of the network remains a feedforward chain of HVC_RA neurons, similar to previous models.

      Strengths:

      The model incorporates all three of the major types of HVC neurons. The ion channels used and their kinetics are based on experimental measurements. The connection patterns of the neurons are also constrained by the experiments.

      Weaknesses:

      The model is described as consisting of micro-circuits corresponding to SSS. This presentation gives the impression that the model's structure is distinct from previous models, which connected HVC_RA neurons in feedforward chain networks (Jin et al 2007, Li & Greenside, 2006; Long et al 2010; Egger et al 2020). However, the authors implement single HVC_RA neurons into chain networks within each micro-circuit and then connect the end of the chain to the start of the chain in the subsequent micro-circuit. Thus, the HVC_RA neuron in their model forms a single-neuron chain. This structure is essentially a simplified version of earlier models.

      In the model of the paper, the chain network drives the HVC_I and HVC_X neurons. The role of the micro-circuits is more significant in organizing the connections: specifically, from HVC_RA neurons to HVC_I neurons, and from HVC_I neurons to both HVC_X and HVC_RA neurons.

      We thank Reviewer 1 for their thoughtful comments.

      While the reviewer is correct about the fact that the propagation of sequential activity in this model is primarily carried by HVC<sub>RA</sub> neurons in a feed-forward manner, we need to emphasize that this is true only if there is no intrinsic or synaptic perturbation to the HVC network. For example, we showed in Figures 10 and 12 how altering the intrinsic properties of HVC<sub>X</sub> neurons or for interneurons disrupts sequence propagation. In other words, while HVC<sub>RA</sub> neurons are the key forces to carry the chain forward, the interplay between excitation and inhibition in our network as well as the intrinsic parameters for all classes of HVC neurons are equally important forces in carrying the chain of activity forward. Thus, the stability of activity propagation necessary for song production depend on a finely balanced network of HVC neurons, with all classes contributing to the overall dynamics. Moreover, all existing models that describe premotor sequence generation in the HVC either assume a distributed model (Elmaleh et al., 2021) that dictates that local HVC circuitry is not sufficient to advance the sequence but rather depends upon moment to-moment feedback through Uva (Hamaguchi et al., 2016), or assume models that rely on intrinsic connections within HVC to propagate sequential activity. In the latter case, some models assume that HVC is composed of multiple discrete subnetworks that encode individual song elements (Glaze & Troyer, 2013; Long & Fee, 2008; Wang et al., 2008), but lacks the local connectivity to link the subnetworks, while other models assume that HVC may have sufficient information in its intrinsic connections to form a single continuous network sequence (Long et al. 2010). The HVC model we present extends the concept of a feedforward network by incorporating additional neuronal classes that influence the propagation of activity (interneurons and HVC<sub>X</sub> neurons). We have shown that any disturbance of the intrinsic or synaptic conductances of these latter neurons will disrupt activity in the circuit even when HVC<sub>RA</sub> neurons properties are maintained. 

      In regard to the similarities between our model and earlier models, several aspects of our model distinguish it from prior work. In short, while several models of how sequence is generated within HVC have been proposed (Cannon et al., 2015; Drew & Abbott, 2003; Egger et al., 2020; Elmaleh et al., 2021; Galvis et al., 2018; Gibb et al., 2009a, 2009b; Hamaguchi et al., 2016; Jin, 2009; Long & Fee, 2008; Markowitz et al., 2015), all the models proposed either rely on intrinsic HVC circuitry to propagate sequential activity, rely on extrinsic feedback to advance the sequence or rely on both. These models do not capture the complex details of spike morphology, do not include the right ionic currents, do not incorporate all classes of HVC neurons, or do not generate realistic firing patterns as seen in vivo. Our model is the first biophysically realistic model that incorporates all classes of HVC neurons and their intrinsic properties. We tuned the intrinsic and the synaptic properties bases on the traces collected by Daou et al. (2013) and Mooney and Prather (2005) as shown in Figure 3. The three classes of model neurons incorporated to our network as well as the synaptic currents that connect them are based on Hodgkin- Huxley formalisms that contain ion channels and synaptic currents which had been pharmacologically identified. This is an advancement over prior models that primarily focused on the role of synaptic interactions or external inputs. The model is based on feedforward chain of microcircuits that encode for the different sub-syllabic segments and that interact with each other through structured feedback inhibition, defining an ordered sequence of cell firing. Moreover, while several models highlight the critical role of inhibitory interneurons in shaping the timing and propagation of bursts of activity in HVC<sub>RA</sub> neurons, our work offers an intricate and comprehensive model that help understand this critical role played by inhibition in shaping song dynamics and ensuring sequence propagation.

      How useful is this concept of micro-circuits? HVC neurons fire continuously even during the silent gaps. There are no SSS during these silent gaps.

      Regarding the concern about the usefulness of the 'microcircuit' concept in our study, we appreciate the comment and we are glad to clarify its relevance in our network. While we acknowledge that HVC<sub>RA</sub> neurons interconnect microcircuits, our model's dynamics are still best described within the framework of microcircuitry particularly due to the firing behavior of HVC<sub>X</sub> neurons and interneurons. Here, we are referring to microcircuits in a more functional sense, rather than rigid, isolated spatial divisions (Cannon et al. 2015), and we now make this clear on page 21. A microcircuit in our model reflects the local rules that govern the interaction between all HVC neuron classes within the broader network, and that are essential for proper activity propagation. For example, HVC<sub>INT</sub> neurons belonging to any microcircuit burst densely and at times other than the moments when the corresponding encoded SSS is being “sung”. What makes a particular interneuron belong to this microcircuit or the other is merely the fact that it cannot inhibit HVC<sub>RA</sub> neurons that are housed in the microcircuit it belongs to. In particular, if HVC<sub>INT</sub> inhibits HVC<sub>RA</sub> in the same microcircuit, some of the HVC<sub>RA</sub> bursts in the microcircuit might be silenced by the dense and strong HVC<sub>INT</sub> inhibition breaking the chain of activity again. Similarly, HVC<sub>X</sub> neurons were selected to be housed within microcircuits due to the following reason: if an HVC<sub>X</sub> neuron belonging to microcircuit i sends excitatory input to an HVC<sub>INT</sub> neuron in microcircuit j, and that interneuron happens to select an HVC<sub>RA</sub> neuron from microcircuit i, then the propagation of sequential activity will halt, and we’ll be in a scenario similar to what was described earlier for HVC<sub>INT</sub> neurons inhibiting HVC<sub>RA</sub> neurons in the same microcircuit.

      We agree that there are no sub-syllabic segments described during the silent gaps and we thank the reviewer to pointing this out. Although silent gaps are integral to the overall process of song production, we have not elaborated on them in this model due to the lack of a clear, biophysically grounded representation for the gaps themselves at the level of HVC. Our primary focus has been on modeling the active, syllable-producing phases of the song, where the HVC network’s sequential dynamics are critical for song. However, one can think the encoding of silent gaps via similar mechanisms that encode SSSs, where each gap is encoded by similar microcircuits comprised of the three classes of HVC neurons (let’s call them GAP rather than SSS) that are active only during the silent gaps. In this case, the propagation of sequential activity is carried throughout the GAPs from the last SSS of the previous syllable to the first SSS of the subsequent syllable. This is no described more clearly on page 22 of the manuscript.

      A significant issue of the current model is that the HVC_RA to HVC_RA connections require fine-tuning, with the network functioning only within a narrow range of g_AMPA (Figure 2B). Similarly, the connections from HVC_I neurons to HVC_RA neurons also require fine-tuning. This sensitivity arises because the somatic properties of HVC_RA neurons are insufficient to produce the stereotypical bursts of spikes observed in recordings from singing birds, as demonstrated in previous studies (Jin et al 2007; Long et al 2010). In these previous works, to address this limitation, a dendritic spike mechanism was introduced to generate an intrinsic bursting capability, which is absent in the somatic compartment of HVC_RA neurons. This dendritic mechanism significantly enhances the robustness of the chain network, eliminating the need to fine-tune any synaptic conductances, including those from HVC_I neurons (Long et al 2010). Why is it important that the model should NOT be sensitive to the connection strengths?

      We thank the reviewer for the comment. While mathematical models designed for highly complex nonlinear biological processes tangentially touch the biological realism, the current network as is right now is the first biologically realistic-enough network model designed for HVC that explains sequence propagation. We do not include dendritic processes in our network although that increases the realistic dynamics for various reasons. 1) The ion channels we integrated into the somatic compartment are known pharmacologically (Daou et al. 2013), but we don’t know about the dendritic compartment’s intrinsic properties of HVC neurons and the cocktail of ion channels that are expressed there. 2) We are able to generate realistic bursting in HVC<sub>RA</sub> neurons despite the single compartment, and the main emphasis in this network is on the interactions between excitation and inhibition, the effects of ion channels in modulating sequence propagation, etc … 3) The network model already incorporates thousands of ODEs that govern the dynamics of each of the HVC neurons, so we did not want to add more complexity to the network especially that we don’t know the biophysical properties of the dendritic compartments.

      Therefore, our present focus is on somatic dynamics and the interaction between HVC<sub>RA</sub> and HVC<sub>INT</sub> neurons, but we acknowledge the importance of these processes in enhancing network resiliency. Although we agree that adding dendritic processes improves robustness, we still think that somatic processes alone can offer insightful information on the sequential dynamics of the HVC network. While the network should be robust across a wide range of parameters, it is also essential that certain parameters are designed to filter out weaker signals, ensuring that only reliable, precise patterns of activity propagate. Hence, we specifically chose to make the HVC<sub>RA</sub>-to-HVC<sub>RA</sub> excitatory connections more sensitive (narrow range of values) such that only strong, precise and meaningful stimuli can propagate through the network representing the high stereotypy and precision seen in song production.

      First, the firing of HVC_I neurons is highly noisy and unreliable. HVC_I neurons fire spontaneous, random spikes under baseline conditions. During singing, their spike timing is imprecise and can vary significantly from trial to trial, with spikes appearing or disappearing across different trials. As a result, their inputs to HVC_RA neurons are inherently noisy. If the model relies on precisely tuned inputs from HVC_I neurons, the natural fluctuations in HVC_I firing would render the model non-functional. The authors should incorporate noisy HVC_I neurons into their model to evaluate whether this noise would render the model non-functional.

      We acknowledge that under baseline and singing settings, interneurons fire in an extremely noisy and inaccurate manner, although they exhibit time locked episodes in their activity (Hahnloser et al 2002, Kozhinikov and Fee 2007). In order to mimic the biological variability of these neurons, our model does, in fact, include a stochastic current to reflect the intrinsic noise and random variations in interneuron firing shown in vivo (and we highlight this in the Methods). However, to make sure the network is resilient to this randomness in interneuron firing, introduced a stochastic input current of the form I<sub>noise</sub> (t)= σ.ξ(t) where ξ(t) is a Gaussian white noise with zero mean and unit variance, and σ is the noise amplitude. This stochastic drive was introduced to every model neuron and it mimics the fluctuations in synaptic input arising from random presynaptic activity and background noise. For values of σ within 1-5% of the mean synaptic conductance, the stochastic current has no effect on network propagation. For larger values of σ, the desired network activity was disrupted or halted. We now talk about this on page 22 of the manuscript.  

      Second, Kosche et al. (2015) demonstrated that reducing inhibition by suppressing HVC_I neuron activity makes HVC_RA firing less sparse but does not compromise the temporal precision of the bursts. In this experiment, the local application of gabazine should have severely disrupted HVC_I activity. However, it did not affect the timing precision of HVC_RA neuron firing, emphasizing the robustness of the HVC timing circuit. This robustness is inconsistent with the predictions of the current model, which depends on finely tuned inputs and should, therefore, be vulnerable to such disruptions.

      We thank the reviewer for the comment. The differences between the Kosche et al. (2015) findings and the predictions of our model arise from differences in the aspect of HVC function we are modeling. Our model is more sensitive to inhibition, which is a designed mechanism for achieving precise song patterning. This is a modeling simplification we adopted to capture specific characteristics of HVC function. Hence, Kosche et al. (2015) findings do not invalidate the approach of our model, but highlights that HVC likely operates with several, redundant mechanisms that overall ensure temporal precision. 

      Third, the reliance on fine-tuning of HVC_RA connections becomes problematic if the model is scaled up to include groups of HVC_RA neurons forming a chain network, rather than the single HVC_RA neurons used in the current work. With groups of HVC_RA neurons, the summation of presynaptic inputs to each HVC_RA neuron would need to be precisely maintained for the model to function. However, experimental evidence shows that the HVC circuit remains functional despite perturbations, such as a few degrees of cooling, micro-lesions, or turnover of HVC_RA neurons. Such robustness cannot be accounted for by a model that depends on finely tuned connections, as seen in the current implementation.

      Our model of individual HVC<sub>RA</sub> neurons and as stated previously is reductive model that focuses on understanding the mechanisms that govern sequential neural activity. We agree that scaling the model to include many of HVC<sub>RA</sub> neurons poses challenges, specifically concerning the summation of presynaptic inputs. However, our model can still be adapted to a larger network without requiring the level of fine-tuning currently needed. In fact, the current fine-tuning of synaptic connections in the model is a reflection of fundamental network mechanisms rather than a limitation when scaling to a larger network. Besides, one important feature of this neural network is redundancy. Even if some neurons or synaptic connections are impaired, other neurons or pathways can compensate for these changes, allowing the activity propagation to remain intact.

      The authors examined how altering the channel properties of neurons affects the activity in their model. While this approach is valid, many of the observed effects may stem from the delicate balancing required in their model for proper function. In the current model, HVC_X neurons burst as a result of rebound activity driven by the I_H current. Rebound bursts mediated by the I_H current typically require a highly hyperpolarized membrane potential. However, this mechanism would fail if the reversal potential of inhibition is higher than the required level of hyperpolarization. Furthermore, Mooney (2000) demonstrated that depolarizing the membrane potential of HVC_X neurons did not prevent bursts of these neurons during forward playback of the bird's own song, suggesting that these bursts (at least under anesthesia, which may be a different state altogether) are not necessarily caused by rebound activity. This discrepancy should be addressed or considered in the model.

      In our HVC network model, one goal with HVC<sub>X</sub> neurons is to generate bursts in their underlying neuron population. Since HVC<sub>X</sub> neurons in our model receive only inhibitory inputs from interneurons, we rely on inhibition followed by rebound bursts orchestrated by the I<sub>H</sub> and the I<sub>CaT</sub> currents to achieve this goal. The interplay between the T-type Ca<sup>++</sup> current and the H current in our model is fundamental to generate their corresponding bursts, as they are sufficient for producing the desired behavior in the network. Due to this interplay, we do not need significant inhibition to generate rebound bursts, because the T-type Ca<sub>++</sub> current’s conductance can be stronger leading to robust rebound bursting even when the degree of inhibition is not very strong. This is now highlighted on page 42 in the revised version.

      Some figures contain direct copies of figures from published papers. It is perhaps a better practice to replace them with schematics if possible.

      We wanted on purpose to keep the results shown in Mooney and Prather (2005) to be shown as is, in order to compare them with our model simulations highlighting the degree of resemblance. We believe that creating schematics of the Mooney and Prather (2005) results will not have the same impact, similarly creating a schematic for Hahnloser et al (2002) results won’t help much. However, if the reviewer still believes that we should do that, we’re happy to do it.

      Reviewer #2 (Public review):

      Summary:

      In this paper, the authors use numerical simulations to try to understand better a major experimental discovery in songbird neuroscience from 2002 by Richard Hahnloser and collaborators. The 2002 paper found that a certain class of projection neurons in the premotor nucleus HVC of adult male zebra finch songbirds, the neurons that project to another premotor nucleus RA, fired sparsely (once per song motif) and precisely (to about 1 ms accuracy) during singing.

      The experimental discovery is important to understand since it initially suggested that the sparsely firing RA-projecting neurons acted as a simple clock that was localized to HVC and that controlled all details of the temporal hierarchy of singing: notes, syllables, gaps, and motifs. Later experiments suggested that the initial interpretation might be incomplete: that the temporal structure of adult male zebra finch songs instead emerged in a more complicated and distributed way, still not well understood, from the interaction of HVC with multiple other nuclei, including auditory and brainstem areas. So at least two major questions remain unanswered more than two decades after the 2002 experiment: What is the neurobiological mechanism that produces the sparse precise bursting: is it a local circuit in HVC or is it some combination of external input to HVC and local circuitry? And how is the sparse precise bursting in HVC related to a songbird's vocalizations? The authors only investigate part of the first question, whether the mechanism for sparse precise bursts is local to HVC. They do so indirectly, by using conductance-based Hodgkin-Huxley-like equations to simulate the spiking dynamics of a simplified network that includes three known major classes of HVC neurons and such that all neurons within a class are assumed to be identical. A strength of the calculations is that the authors include known biophysically deduced details of the different conductances of the three major classes of HVC neurons, and they take into account what is known, based on sparse paired recordings in slices, about how the three classes connect to one another. One weakness of the paper is that the authors make arbitrary and not well-motivated assumptions about the network geometry, and they do not use the flexibility of their simulations to study how their results depend on their network assumptions. A second weakness is that they ignore many known experimental details such as projections into HVC from other nuclei, dendritic computations (the somas and dendrites are treated by the authors as point-like isopotential objects), the role of neuromodulators, and known heterogeneity of the interneurons. These weaknesses make it difficult for readers to know the relevance of the simulations for experiments and for advancing theoretical understanding.

      Strengths:

      The authors use conductance-based Hodgkin-Huxley-like equations to simulate spiking activity in a network of neurons intended to model more accurately songbird nucleus HVC of adult male zebra finches. Spiking models are much closer to experiments than models based on firing rates or on 2-state neurons.

      The authors include information deduced from modeling experimental current-clamp data such as the types and properties of conductances. They also take into account how neurons in one class connect to neurons in other classes via excitatory or inhibitory synapses, based on sparse paired recordings in slices by other researchers. The authors obtain some new results of modest interest such as how changes in the maximum conductances of four key channels (e.g., A-type K+ currents or Ca-dependent K+ currents) influence the structure and propagation of bursts, while simultaneously being able to mimic accurately current-clamp voltage measurements.

      Weaknesses:

      One weakness of this paper is the lack of a clearly stated, interesting, and relevant scientific question to try to answer. In the introduction, the authors do not discuss adequately which questions recent experimental and theoretical work have failed to explain adequately, concerning HVC neural dynamics and its role in producing vocalizations. The authors do not discuss adequately why they chose the approach of their paper and how their results address some of these questions.

      For example, the authors need to explain in more detail how their calculations relate to the works of Daou et al, J. Neurophys. 2013 (which already fitted spiking models to neuronal data and identified certain conductances), to Jin et al J. Comput. Neurosci. 2007 (which already discussed how to get bursts using some experimental details), and to the rather similar paper by E. Armstrong and H. Abarbanel, J. Neurophys 2016, which already postulated and studied sequences of microcircuits in HVC. This last paper is not even cited by the authors.

      We thank the reviewer for this valuable comment, and we agree that we did not clarify enough throughout the paper the utility of our model or how it advanced our understanding of the HVC dynamics and circuitry. To that end, we revised several places of the manuscript and made sure to cite and highlight the relevance and relatedness of the mentioned papers.

      In short, and as mentioned to Reviewer 1, while several models of how sequence is generated within HVC have been proposed (Cannon et al., 2015; Drew & Abbott, 2003; Egger et al., 2020; Elmaleh et al., 2021; Galvis et al., 2018; Gibb et al., 2009a, 2009b; Hamaguchi et al., 2016; Jin, 2009; Long & Fee, 2008; Markowitz et al., 2015; Jin et al., 2007), all the models proposed either rely on intrinsic HVC circuitry to propagate sequential activity, rely on extrinsic feedback to advance the sequence or rely on both. These models do not capture the complex details of spike morphology, do not include the right ionic currents, do not incorporate all classes of HVC neurons, or do not generate realistic firing patterns as seen in vivo. Our model is the first biophysically realistic model that incorporates all classes of HVC neurons and their intrinsic properties. 

      No existing hypothesis had been challenged with our model, rather; our model is a distillation of the various models that’s been proposed for the HVC network. We go over this in detail in the Discussion. We believe that the network model we developed provide a step forward in describing the biophysics of HVC circuitry, and may throw a new light on certain dynamics in the mammalian brain, particularly the motor cortex and the hippocampus regions where precisely-timed sequential activity is crucial. We suggest that temporally-precise sequential activity may be a manifestation of neural networks comprised of chain of microcircuits, each containing pools of excitatory and inhibitory neurons, with local interplay among neurons of the same microcircuit and global interplays across the various microcircuits, and with structured inhibition as well as intrinsic properties synchronizing the neuronal pools and stabilizing timing within a firing sequence.

      The authors' main achievement is to show that simulations of a certain simplified and idealized network of spiking neurons, which includes some experimental details but ignores many others, match some experimental results like current-clamp-derived voltage time series for the three classes of HVC neurons (although this was already reported in earlier work by Daou and collaborators in 2013), and simultaneously the robust propagation of bursts with properties similar to those observed in experiments. The authors also present results about how certain neuronal details and burst propagation change when certain key maximum conductances are varied. However, these are weak conclusions for two reasons. First, the authors did not do enough calculations to allow the reader to understand how many parameters were needed to obtain these fits and whether simpler circuits, say with fewer parameters and simpler network topology, could do just as well. Second, many previous researchers have demonstrated robust burst propagation in a variety of feed-forward models. So what is new and important about the authors' results compared to the previous computational papers?

      A major novelty of our work is the incorporation of experimental data with detailed network models. While earlier works have established robust burst propagation, our model uses realistic ion channel kinetics and feedback inhibition not only to reproduce experimental neural activity patterns but also to suggest prospective mechanisms for song sequence production in the most biophysical way possible. This aspect that distinguishes our work from other feed-forward models. We go over this in detail in the Discussion. However, the reviewer is right regarding the details of the calculations conducted for the fits, we will make sure to highlight this in the Methods and throughout the manuscript with more details.

      We believe that the network model we developed provide a step forward in describing the biophysics of HVC circuitry, and may throw a new light on certain dynamics in the mammalian brain, particularly the motor cortex and the hippocampus regions where precisely-timed sequential activity is crucial. We suggest that temporally-precise sequential activity may be a manifestation of neural networks comprised of chain of microcircuits, each containing pools of excitatory and inhibitory neurons, with local interplay among neurons of the same microcircuit and global interplays across the various microcircuits, and with structured inhibition as well as intrinsic properties synchronizing the neuronal pools and stabilizing timing within a firing sequence.

      Also missing is a discussion, or at least an acknowledgment, of the fact that not all of the fine experimental details of undershoots, latencies, spike structure, spike accommodation, etc may be relevant for understanding vocalization. While it is nice to know that some models can match these experimental details and produce realistic bursts, that does not mean that all of these details are relevant for the function of producing precise vocalizations. Scientific insights in biology often require exploring which of the many observed details can be ignored and especially identifying the few that are essential for answering some questions. As one example, if HVC-X neurons are completely removed from the authors' model, does one still get robust and reasonable burst propagation of HVC-RA neurons? While part of the nucleus HVC acts as a premotor circuit that drives the nucleus RA, part of HVC is also related to learning. It is not clear that HVC-X neurons, which carry out some unknown calculation and transmit information to area X in a learning pathway, are relevant for burst production and propagation of HVCRA neurons, and so relevant for vocalization. Simulations provide a convenient and direct way to explore questions of this kind.

      One key question to answer is whether the bursting of HVC-RA projection neurons is based on a mechanism local to HVC or is some combination of external driving (say from auditory nuclei) and local circuitry. The authors do not contribute to answering this question because they ignore external driving and assume that the mechanism is some kind of intrinsic feed-forward circuit, which they put in by hand in a rather arbitrary and poorly justified way, by assuming the existence of small microcircuits consisting of a few HVC-RA, HVC-X, and HVC-I neurons that somehow correspond to "sub-syllabic segments". To my knowledge, experiments do not suggest the existence of such microcircuits nor does theory suggest the need for such microcircuits. 

      Recent results showed a tight correlation between the intrinsic properties of neurons and features of song (Daou and Margoliash 2020, Medina and Margoliash 2024), where adult birds that exhibit similar songs tend to have similar intrinsic properties. While this is relevant, we acknowledge that not all details may be necessary for every aspect of vocalization, and future models could simplify concentrate on core dynamics and exclude certain features while still providing insights into the primary mechanisms.

      The question of whether HVC<sub>X</sub> neurons are relevant for burst propagation given that our model includes these neurons as part of the network for completeness, the reviewer is correct, the propagation of sequential activity in this model is primarily carried by HVC<sub>RA</sub> neurons in a feed-forward manner, but only if there is no perturbation to the HVC network. For example, we have shown how altering the intrinsic properties of HVC<sub>X</sub> neurons or for interneurons disrupts sequence propagation. In other words, while HVC neurons are the key forces to carry the chain forward, the interplay between excitation and inhibition in our network as well as the intrinsic parameters for all classes of HVC neurons are equally important forces in carrying the chain of activity forward. Thus, the stability of activity propagation necessary for song production depend on a finely balanced network of HVC neurons, with all classes contributing to the overall dynamics.

      We agree with the reviewer however that a potential drawback of our model is that its sole focus is on local excitatory connectivity within the HVC (Kornfeld et al., 2017; Long et al., 2010), while HVC neurons receive afferent excitatory connections (Akutagawa & Konishi, 2010; Nottebohm et al., 1982) that plays significant roles in their local dynamics. For example, the excitatory inputs that HVC neurons receive from Uvaeformis may be crucial in initiating (Andalman et al., 2011; Danish et al., 2017; Galvis et al., 2018) or sustaining (Hamaguchi et al., 2016) the sequential activity. While we acknowledge this limitation, our main contribution in this work is the biophysical insights onto how the patterning activity in HVC is largely shaped by the intrinsic properties of the individual neurons as well as the synaptic properties where excitation and inhibition play a major role in enabling neurons to generate their characteristic bursts during singing. This is true and holds irrespective of whether an external drive is injected onto the microcircuits or not. We elaborated on this further in the revised version in the Discussion.

      Another weakness of this paper is an unsatisfactory discussion of how the model was obtained, validated, and simulated. The authors should state as clearly as possible, in one location such as an appendix, what is the total number of independent parameters for the entire network and how parameter values were deduced from data or assigned by hand. With enough parameters and variables, many details can be fit arbitrarily accurately so researchers have to be careful to avoid overfitting. If parameter values were obtained by fitting to data, the authors should state clearly what the fitting algorithm was (some iterative nonlinear method, whose results can depend on the initial choice of parameters), what the error function used for fitting (sum of least squares?) was, and what data were used for the fitting.

      The authors should also state clearly the dynamical state of the network, the vector of quantities that evolve over time. (What is the dimension of that vector, which is also the number of ordinary differential equations that have to be integrated?) The authors do not mention what initial state was used to start the numerical integrations, whether transient dynamics were observed and what were their properties, or how the results depended on the choice of the initial state. The authors do not discuss how they determined that their model was programmed correctly (it is difficult to avoid typing errors when writing several pages or more of a code in any language) or how they determined the accuracy of the numerical integration method beyond fitting to experimental data, say by varying the time step size over some range or by comparing two different integration algorithms.

      We thank the reviewer again. The fitting process in our model occurred only at the first stage where the synaptic parameters were fit to the Mooney and Prather as well as the Kosche results. There was no data shared and we merely looked at the figures in those papers and checked the amplitude of the elicited currents, the magnitudes of DC-evoked excitations etc … and we replicated that in our model. While this is suboptimal, it was better for us to start with it rather than simply using equations for synaptic currents from the literature for other types of neurons (that are not even HVC’s or in the songbird) and integrate them into our network model. The number of ODEs that govern the dynamics of every model neuron is listed on page 10 of the manuscript as well as in the Appendix.  Moreover, we highlighted the details of this fitting process in the revised version.

      Also disappointing is that the authors do not make any predictions to test, except rather weak ones such as that varying a maximum conductance sufficiently (which might be possible by using dynamic clamps) might cause burst propagation to stop or change its properties. Based on their results, the authors do not make suggestions for further experiments or calculations, but they should.

      We agree that making experimental testable predictions is crucial for the advancement of the model. Our predictions include testing whether eradication of a class of neurons such as HVC<sub>X</sub> neurons disrupts activity propagation which can be done through targeted neuron elimination. This also can be done through preventing rebound bursting in HVC<sub>X</sub> by pharmacologically blocking the I<sub>H</sub> channels. Others include down regulation of certain ion channels (pharmacologically done through ion blockers) and testing which current is fundamental for song production (and there a plenty of test based our results, like the SK current, the T-type Ca<sup>2+</sup> current, the A-type K<sup>+</sup> current, etc…). We incorporated these into the Discussion of the revised manuscript to better demonstrate the model's applicability and to guide future research directions.

      Main issues:

      (1) Parameters are overly fine-tuned and often do not match known biology to generate chains. This fine-tuning does not reveal fundamental insights.

      (1a) Specific conductances (e.g. AMPA) are finely tweaked to generate bursts, in part due to a lack of a dendritic mechanism for burst generation. A dendritic mechanism likely reflects the true biology of HVC neurons.

      We acknowledge that the model does not include active dendritic processes and we do not regard this as a limitation. In fact, our present approach, although simplified, is intended to focus on somatic mechanisms to identify minimal conditions required for stable sequential propagation. We know HVC<sub>RA</sub> neurons possess thin, spiny dendrites which can contribute to burst initiation and shaping. Future models that include such nonlinear dendritic mechanisms would likely reduce the need for fine tuning of specific conductances at the soma and consequently better match the known biology of HVC<sub>RA</sub> neurons. 

      In text: “While our simplified, somatically driven architecture enables better exploration of mechanisms for sequence propagation, future extensions of the model will incorporate dendritic compartments to more accurately reflect the intrinsic bursting mechanisms observed in HVC<sub>RA</sub> neurons.”

      (1b) In this paper, microcircuits are simulated and then concatenated to make the HVC chain, resulting in no representations during silent gaps. This is out of touch with the known HVC function. There is no anatomical nor functional evidence for microcircuits of the kind discussed in this paper or in the earlier and rather similar paper by Eve Armstrong and Henry Abarbanel (J. Neurophy 2016). One can write a large number of papers in which one makes arbitrary unconstrained guesses of network structure in HVC and, unless they reveal some novel principle or surprising detail, they are all going to be weak.

      Although the model is composed of sequentially activated microcircuits, the gaps between each microcircuit’s output do not represent complete silence in the network. During these periods, other neurons such as those in other microcircuits may still exhibit bursting activity. Thus, what may appear as a 'silent gap' from the perspective of a given output microcircuit is, in fact, part of the ongoing background dynamics of the larger HVC neuron network. We fully acknowledge the reviewer's point that there is no direct anatomical or physiological evidence supporting the presence of microcircuits with this structure in HVC. Our intention was not to propose the existence of such a physical model but to use it as a computational simplification to make precise sequential bursting activity feasible given the biologically realistic neuronal dynamics used. Hence, our use of 'microcircuits' refers to a modeling construct rather than a structural hypothesis. Even if the network topology is hypothetical, we still believe that the temporal structuring suggested allows us to generate specific predictions for future work about burst timing and neuronal connections.

      (1c) HVC interneuron discharge in the author's model is overly precise; addressing the observation that these neurons can exhibit noisy discharge. Real HVC interneurons are noisy. This issue is critical: All reviewers strongly recommend that the authors should, at the minimum in a revision, focus on incorporating HVC-I noise in their model.

      We agree that capturing the variability in interneuron bursting is critical for biological realism. In our model, HVC interneurons receive stochastic background current that introduces variability in their firing patterns as observed in vivo. This variability is seen in our simulations and produces more biologically realistic dynamics while maintaining sequence propagation. We clarify this implementation in the Methods section. 

      (1d) Address the finding that Kosche et al show that even with reduced inhibition, HVCra neuronal timing is preserved; it is the burst pattern that is affected.

      The differences between the Kosche et al. (2015) findings and the predictions of our model arise from differences in the aspect of HVC function we are modeling. Our model is more sensitive to inhibition, which is a designed mechanism for achieving precise song patterning. This is a modeling simplification we adopted to capture specific characteristics of HVC function. 

      We acknowledged this point in the discussion: “While findings of Kosche et al. (2015) emphasize the robustness of the HVC timing circuit to inhibition, our model is more sensitive to inhibition, highlighting that HVC likely operates with several, redundant mechanisms that overall ensure temporal precision.”

      (1e) The real HVC is robust to microlesions, cooling, and HVCra neuron turnover. The model in this paper relies on precise HVCra connectivity and is not robust.

      Although our model is grounded in the biologically observed behavior of HVC neurons in vivo, we don’t claim that it fully captures the resilience seen in the HVC network. Instead, we see this as a simplified framework that helps us explore the basic principles of sequential activity. In the future, adding features like recurrent excitation, synaptic plasticity, or homeostatic mechanisms could make the model more robust.

      (1f) There is unclear motivation for Ih-driven HVCx bursting, given past findings from the Mooney group.

      Daou et al (2013) noticed that the observed in HVC<sub>X</sub> and HVC<sub>INT</sub> neurons in response to hyperpolarizing current pulses (Dutar et al. 1998; Kubota and Saito 1991; Kubota and Taniguchi 1998) was completely abolished after the application of the drug ZD 7288 in all of the neurons tested indicating that the sag in these HVC neurons is due to the hyperpolarization-activated inward current (I<sub>h</sub>). in addition, the sag and the rebound seen in these two neuron groups were larger as for larger hyperpolarization current pulses.

      (1g) The initial conditions of the network and its activity under those conditions, as well as the possible reliance on external inputs, are not defined.

      In our model, network activity is initiated through a brief, stochastic excitatory input to a small HVC<sub>RA</sub> neuron of one microcircuit. This drive represents a simplified version of external input from upstream brain regions known to project to HVC, such as nuclei in the high vocal center's auditory pathways such as Nif and Uva. Modeling the activity of these upstream regions and their influence on HVC dynamics is an ongoing research work to be published in the future.

      (1h) It has been known from the time of Hodgkin and Huxley how to include temperature dependences for neuronal dynamics so another suggestion is for the authors to add such dependences for the three classes of neurons and see if their simulation causes burst frequencies to speed up or slow down as T is varied.

      We added this as limitation to the discussion section: “Our model was run at a fixed physiological temperature, but it's well known going all the way back to Hodgkin and Huxley that both ion channel activity and synaptic dynamics can change with temperature. In future work, adding temperature scaling (like Q10 factors) could help us explore how burst timing and sequence speed change with temperature changes, and how neural activity in HVC would/would not preserve its precision under different physiological conditions.”

      (2) The scope of the paper and its objectives must be clearly defined. Defining the scope and providing caveats for what is not considered will help the reader contextualize this study with other work.

      (2a) The paper does not consider the role of external inputs to HVC, which are very likely important for the capacity of the HVC chain to tile the entire song, including silent gaps.

      The role of afferent input to HVC particularly from nuclei such as Uva and Nif is critical in shaping the timing and initiation of HVC sequences throughout the song, including silent intervals. In fact, external inputs are likely involved in more than just triggering sequences, they may also influence the continuity of activity across motifs. However, in this study, we chose to focus on the intrinsic dynamics of HVC as a step toward understanding the internal mechanisms required for generating temporally precise sequences and for this reason, we used a simplified external input only to initiate activity in the chain.

      (2b) The paper does not consider important dendritic mechanisms that almost certainly facilitate the all-or-none bursting behavior of HVC projection neurons. the authors need to mention and discuss that current-clamped neuronal response - in which an electrode is inserted into the soma and then a constant current-step is applied - bypasses dendritic structure and dendritic processing and so is an incomplete way to characterize a neuron's properties. In particular, claiming to fit current-clamp data accurately and then claiming that one now has a biophysically accurate network model, as the authors do, is greatly misleading.

      While we addressed this is 1a, we do not suggest that our model is a fully accurate biophysical representation of HVC network. Instead, we see it as a simplified framework that helps reveal how much of HVC’s sequential activity can be explained by somatic properties and synaptic interactions alone. However, additional biological mechanisms, like dendritic processing, are likely to play an important role and should be explored in future work.

      (2c) The introduction does not provide a clear motivation for the paper - what hypotheses are being tested? What is at stake in the model outcomes? It is not inherently informative to take a known biological representation and fine-tune a limited model to replicate that representation.

      We explicitly added the hypotheses to the revised introduction.

      (2d) There have been several published modeling efforts applied to the HVC chain (Seung, Fee, Long, Greenside, Jin, Margoliash, Abarbanel). These and others need to be introduced adequately, and it needs to be crystal clear what, if anything, the present study is adding to the canon.

      While several influential models have explored how HVC might generate sequences ranging from synfire chains to recurrent dynamics or externally driven sequences (e.g., Seung, Fee, Long, Greenside, Jin, Abarbanel, and others), these models could not capture the detailed dynamics observed in vivo. Our aim was to bridge a gap in the modeling literature by exploring how far biophysically grounded intrinsic properties and experimentally supported synaptic connections that are local to the HVC can alone produce temporally precise sequences. We have proven that these mechanisms are sufficient to generate these sequences, although some missing components (such as dendritic mechanisms or external inputs) might be needed to fully capture the complexity and robustness of HVC function.

      (2e) The authors mention learning prominently in the abstract, summary, and introduction but this paper has nothing to do with learning. Most or all mentions of learning should be deleted since they are misleading.

      We appreciate the reviewer’s observation however our intent by referencing learning was not to suggest that our model directly simulates learning processes, but rather to place HVC function within the broader context of song learning and production, where temporal sequencing plays a fundamental role. Yet, repeated references to learning may be misleading given that our current model does not incorporate plasticity, synaptic modification, or developmental changes. Hence, we have carefully revised the manuscript to rephrase mentions of learning unless directly relevant to context. 

      (3) Using the model for hypothesis generation and prediction of experimental results.

      (3a) The utility of a model is to provide conceptual insight into how or why the real HVC functions as it does, or to predict outcomes in yet-to-be conducted experiments to help motivate future studies. This paper does not adequately achieve these goals.

      We revised the Discussion of the manuscript to better emphasize potential contributions and point out many experiments that could validate or challenge the model’s predictions. These include dynamic clamp or ion channel blockers targeting A-type K<sup>+</sup> in HVC<sub>RA</sub> neurons to assess their impact on burst precision, optogenetic disruption of inhibitory interneurons to observe changes in burst timing and sequence propagation, pharmacological modulation of I<sub>h</sub> or I<sub>CaT</sub> in HVC<sub>X</sub> and interneurons etc. 

      (3b) Additionally, it can be interesting to conduct an experiment on an existing model; for example, what happens to the HVCra chain in your model if you delete the HVCx neurons? What happens if you block NMDA receptors? Such an approach in a modeling paper can help motivate hypotheses and endow the paper with a sense of purpose.

      We agree that running targeted experiments to test our computational model such as removing an HVC neuron population or blocking a synaptic receptor can be a powerful way to generate new ideas and guide future experiments. While we didn’t include these specific tests in the current study, the model is well suited for this kind of exploration. For instance, removing interneurons could help us better understand their role in shaping the timing of HVC<sub>RA</sub> bursts. These are great directions for future experiments, and we now highlight this in the discussion as a way the model could be used to guide experiments.

      (4) Changes to the paper's organization may improve clarity.

      (4a) Nearly all equations should be moved to an Appendix so that the main part of the paper can focus on the science: assumptions made, details of simulations, conclusions obtained, and their significance. The authors present many equations without discussion which weakens the paper.

      Equations moved to appendix.

      (4b) There are many grammatical errors, e.g., verbs do not match the subject in terms of being single or plural. The authors need to run their manuscript through a grammar checker.

      Done.

      (4c) Many of the figures are poorly designed and should be substantially modified. E.g. in Figure 1B, too many colors are used, making it hard to grasp what is being plotted and the colors are not needed. Figures 1C and 1D are entire figures taken from other papers, and there is no way a reader will be able to see or appreciate all the details when this figure is published on a single page. Figure 2 uses colors for dots that are almost identical, and the colors could be avoided by using different symbols. Figure 5 fills an entire page but most of the figure conveys no information, there is no need to show the same details for all 120 neurons, just show the top 1/3 of this figure; the same for Figure 7, a lot of unnecessary information is being included. Figure 10, the bottom time series of spikes should be replaced with a time series of rates, cannot extract useful information.

      Adjusted as requested. 

      (4d) Table 1 is long and largely uninteresting, and should be moved to an appendix.

      Table 1 moved to appendix.

      (4e) Many sentences are not carefully written, which greatly weakens the paper. As one typical example, the first sentence in the Discussion section "In this study, we have designed a neural network model that describes [sic] zebra finch song production in the HVC." This is inaccurate, the model does not describe song production, it just explores some properties of one nucleus involved with song production. Just one or few sentences like this is ok but there are so many sentences of this kind that the reader loses faith in the authors.

      Thank you for raising this point, we revised the manuscript to improve the precision of the writing. We replaced the first sentence of the discussion with this: "In this study, we developed a biophysically realistic neural network model to explore how intrinsic neuronal properties and local connectivity within the songbird nucleus HVC may support the generation of temporally precise activity sequences associated with zebra finch song."

    1. Author response:

      The following is the authors’ response to the previous reviews

      Reviewer #1 (Public review):

      Summary

      The authors previously published a study of RGC boutons in the dLGN in developing wild-type mice and developing mutant mice with disrupted spontaneous activity. In the current manuscript, they have broken down their analysis of RGC boutons according to the number of Homer/Bassoon puncta associated with each vGlut3 cluster.

      The authors find that, in the first post-natal week, RGC boutons with multiple active zones (mAZs) are about a third as common as boutons with a single active zone (sAZ). The size of the vGluT2 cluster associated with each bouton was proportional to the number of active zones present in each bouton. Within the author's ability to estimate these values (n=3 per group, 95% of results expected to be within ~2.5 standard deviations), these results are consistent across groups: 1) dominant eye vs. nondominant eye, 2) wild-type mice vs. mice with activity blocked, and at 3) ages P2, P4, and P8. The authors also found that mAZs and sAZs also have roughly the same number (about 1.5) of sAZs clustered around them (within 1.5 um).

      However, the authors do not interpret this consistency between groups as evidence that active zone clustering is not a specific marker or driver of activity dependent synaptic segregation. Rather, the authors perform a large number of tests for statistical significance and cite the presence or absence of statistical significance as evidence that "Eye-specific active zone clustering underlies synaptic competition in the developing visual system (title)". I don't believe this conclusion is supported by the evidence.

      We have revised the title to be descriptive: "Eye-specific differences in active zone addition during synaptic competition in the developing visual system." While our correlative approach does not establish direct causality, our findings provide important structural evidence that complements existing functional studies of activity-dependent synaptic refinement. We have carefully revised the text throughout to avoid causal language, focusing instead on the developmental patterns we observe.

      Strengths

      The source dataset is high resolution data showing the colocalization of multiple synaptic proteins across development. Added to this data is labeling that distinguishes axons from the right eye from axons from the left eye. The first order analysis of this data showing changes in synapse density and in the occurrence of multi-active zone synapses is useful information about the development of an important model for activity dependent synaptic remodeling.

      Weaknesses

      In my previous review I argued that it was not possible to determine, from their analysis, whether the differences they were reporting between groups was important to the biology of the system. The authors have made some changes to their statistics (paired t-tests) and use some less derived measures of clustering. However, they still fail to present a meaningfully quantitative argument that the observed group differences are important. The authors base most of their claims on small differences between groups. There are two big problems with this practice. First, the differences between groups appear too small to be biologically important. Second, the differences between groups that are used as evidence for how the biology works are generally smaller than the precision of the author's sampling. That is, the differences are as likely to be false positives as true positives.

      (1) Effect size. The title claims: "Eye-specific active zone clustering underlies synaptic competition in the developing visual system". Such a claim might be supported if the authors found that mAZs are only found in dominant-eye RGCs and that eye-specific segregation doesn't begin until some threshold of mAZ frequency is reached. Instead, the behavior of mAZs is roughly the same across all conditions. For example, the clear trend in Figure 4C and D is that measures of clustering between mAZ and sAZ are as similar as could reasonably be expected by the experimental design. However, some of the comparisons of very similar values produced p-values < 0.05. The authors use this fact to argue that the negligible differences between mAZ and sAZs explain the development of the dramatic differences in the distribution of ipsilateral and contralateral RGCs.

      We have changed the title to avoid implying a causal relationship between clustering and eye-specific segregation. Our key findings in Figures 4C and 4D demonstrate effect sizes >2.0 with high statistical power (Supplemental Table S2). While the absolute magnitude of differences is modest (5-7%), these high effect sizes combined with low inter-animal variability demonstrate consistent, reproducible biological phenomena. During development, small differences during critical periods can have profound downstream consequences for synaptic refinement outcomes.

      We acknowledge that significance in Figure 4 arises due to low variance between biological replicates rather than large mean differences. We have revised the text to describe these as "slight" differences and that "WT mice show a tendency toward forming more synapses near mAZ inputs," reflecting appropriate caution in our interpretation while maintaining the statistical robustness of our findings.

      (2) Sample size. Performing a large number of significance tests and comparing pvalues is not hypothesis testing and is not descriptive science. At best, with large sample sizes and controls for multiple tests, this approach could be considered exploratory. With n=3 for each group, many comparisons of many derived measures, among many groups, and no control for multiple testing, this approach constitutes a random result generator.

      The authors argue that n=3 is a large sample size for the type of high resolution / large volume data being used. It is true that many electron microscopy studies with n=1 are used to reveal the patterns of organization that are possible within an individual. However, such studies cannot control individual variation and are, therefore, not appropriate for identifying subtle differences between groups.

      In response to previous critiques along these lines, the authors argue they have dealt with this issue by limiting their analysis to within-individual paired comparisons. There are several problems with their thinking in this approach. The main problem is that they did not change the logic of their arguments, only which direction they pointed the t-tests. Instead of claiming that two groups are different because p < 0.05, they say that two groups are different because one produced p < 0.05 and the other produced p > 0.05. These arguments are not statistically valid or biologically meaningful.

      We have implemented rigorous statistical controls, applying false discovery rate (FDR) correction using the Benjamini-Hochberg method (α = 0.05) within each experimental condition (age × genotype combination). This correction strategy treats each condition as addressing a distinct experimental question: “What synaptic properties differ between left eye and right eye inputs in this specific developmental stage and genotype?” The approach appropriately controls for multiple testing while preserving power to detect biologically meaningful differences. We applied FDR correction separately to the ~20-34 measurements (varying by age and genotype) within each of the six experimental conditions, resulting in condition-specific adjusted p-values reported in updated Supplemental Table S2. This correction confirmed the robustness of our key findings. We do not base conclusions solely on comparing p-values across conditions. Our interpretations focus on effect sizes, confidence intervals, and consistent patterns within each condition, with statistical significance providing supporting evidence rather than the primary basis for biological conclusions.

      To the best of my understanding, the results are consistent with the following model:

      RGCs form mAZs at large boutons (known)

      About a quarter of week-one RGC boutons are mAZs (new observation)

      Vesicle clustering is proportional to active zone number (~new observation)

      RGC synapse density increases during the first post-week (known)

      Blocking activity reduces synapse density (known)

      Contralateral eye RGCs for more and larger synapses in the lateral dLGN (known)

      While mAZ formation is known in adult and juvenile dLGN, the formation of mAZ boutons during eye-specific competition represents new information with important functional implications. Synapses with multiple release sites should be stronger than single-active-zone synapses, suggesting a structural correlate for competitive advantage during refinement.

      We demonstrate distinct developmental patterns for sAZ versus mAZ contacts during the first postnatal week. Multi-active zone density favors the dominant eye, while single active-zone synapse density from the competing eye increases from P2-P4 to match dominant-eye levels. This reveals that newly formed synapses from the competing eye predominantly contain single release sites, marking P4-P8 as a critical window for understanding molecular mechanisms driving synaptic elimination.

      Our results show that altered retinal activity patterns (β2KO mice) reduce synapse density during eye-specific competition. We relied on β2 knockout mice, which retain retinal waves and spontaneous spike activity but with disrupted patterns and output levels compared to controls. We make no claims about complete activity blockade. Previous studies using different activity manipulations (epibatidine, TTX) have examined terminal morphology, but effects on synapse density during competition remain largely unknown. Achieving complete retinal activity blockade is technically challenging, making it of interest to revisit the role of activity using more precise manipulations to control spike output and relative timing.

      With n=3 and effect sizes smaller than 1 standard deviation, a statistically significant result is about as likely to be a false positive as a true positive.

      A true-positive statistically significant result does is not evidence of a meaningful deviation from a biological model.

      Our conclusions are based on results with effect sizes substantially larger than 1. Key findings demonstrate effect sizes exceeding 2.0. These large effect sizes, combined with rigorous FDR correction and low inter-animal variability, provide evidence against false positive results. During critical developmental periods, consistent structural differences, even those modest in absolute magnitude, can reflect important regulatory mechanisms that influence refinement outcomes. All statistical results, effect sizes, and power analyses are reported in Supplementary Tables S2, with confidence intervals in Supplementary Table S3. We have revised the text in several places where small differences are presented to reflect appropriate caution in our interpretation.

      Providing plots that show the number of active zones present in boutons across these various conditions is useful. However, I could find no compelling deviation from the above default predictions that would influence how I see the role of mAZs in activity dependent eye-specific segregation.

      Below are critiques of most of the claims of the manuscript.

      Claim (abstract): individual retinogeniculate boutons begin forming multiple nearby presynaptic active zones during the first postnatal week.

      Confirmed by data.

      Claim (abstract): the dominant-eye forms more numerous mAZ contacts,

      Misleading: The dominant-eye (by definition) forms more contacts than the nondominant eye. That includes mAZ.

      While the dominant eye forms more total contacts, the pattern depends critically on contact type and developmental stage. The dominant eye forms more mAZ contacts across all ages (Figures 2 and S1). However, for sAZ contacts, the two eyes form similar numbers at P4, with the non-dominant eye showing increased sAZ formation during this critical period. This differential pattern by synapse type represents an important aspect of how synaptic competition unfolds structurally.

      Claim (abstract): At the height of competition, the non-dominant-eye projection adds many single active zone (sAZ) synapses

      Weak: While the individual observation is strong, it is a surprising deviation based on a single n=3 experiment in a study that performed twelve such experiments (six ages, mutant/wildtype, sAZ/mAZ)

      The difference in eye-specific sAZ formation at P2 and P8 had effect sizes of ~5.3 and ~2.7 respectively (after FDR correction the difference was still significant at P2 and trending at P8). At P4, no effect was observed by paired T-test and the 5/95% confidence intervals ranged from -0.021-0.008 synapses/m<sup>3</sup>. The consistency of this pattern across P2 and P8, combined with the large effect sizes, supports the reliability of this developmental finding. We report all effect sizes and power test analyses in Supplemental Table S2, and confidence intervals in Supplemental Table S3. 

      Claim (abstract): Together, these findings reveal eye-specific differences in release site addition during synaptic competition in circuits essential for visual perception and behavior.

      False: This claim is unambiguously false. The above findings, even if true, do not argue for any functional significance to active zone clustering.

      Our phrasing “circuits essential for visual perception and behavior” referred to the general importance of binocular organization in the retinogeniculate system for visual processing and we did not intend to claim direct functional significance of our structural data. For clarity we have deleted the latter part of this sentence. In lines 35-37, the abstract now reads “Together, these findings reveal eye-specific differences in release site addition that correlate with axonal refinement outcomes during retinogeniculate refinement.”

      Claim (line 84): "At the peak of synaptic competition midway through the first postnatal week, the non-dominant-eye formed numerous sAZ inputs, equalizing the global synapse density between the two eyes"

      Weak: At one of twelve measures (age, bouton type, genotype) performed with 3 mice each, one density measure was about twice as high as expected.

      The difference in eye-specific sAZ formation at P2 and P8 had effect sizes of ~5.3 and ~2.7 respectively (after FDR correction the difference was still significant at P2 and trending at P8). At P4, no effect was observed by paired T-test and the 5/95% confidence intervals ranged from -0.021-0.008 synapses/m<sup>3</sup>. The consistency of this pattern across P2 and P8, combined with the large effect sizes, supports the reliability of this developmental finding. We report all effect sizes and power test analyses in Supplemental Table S2, and confidence intervals in Supplemental Table S3. 

      Claim (line 172): "In WT mice, both mAZ (Fig. 3A, left) and sAZ (Fig. 3B, left) inputs showed significant eye-specific volume differences at each age."

      Questionable: There appears to be a trend, but the size and consistency is unclear.

      Claim (line 175): "the median VGluT2 cluster volume in dominant-eye mAZ inputs was 3.72 fold larger than that of non-dominant-eye inputs (Fig. 3A, left)."

      Cherry picking. Twelve differences were measured with an n of 3, 3 each time. The biggest difference of the group was cited. No analysis is provided for the range of uncertainty about this measure (2.5 standard deviations) as an individual sample or as one of twelve comparisons.

      Claim (line 174): "In the middle of eye-specific competition at P4 in WT mice, the median VGluT2 cluster volume in dominant-eye mAZ inputs was 3.72 fold larger than that of non-dominant-eye inputs (Fig. 3A, left). In contrast, β2KO mice showed a smaller 1.1 fold difference at the same age (Fig. 3A, right panel). For sAZ synapses at P4, the magnitudes of eye-specific differences in VGluT2 volume were smaller: 1.35-fold in WT (Fig. 3B, left) and 0.41-fold in β2KO mice (Fig. 3B, right). Thus, both mAZ and sAZ input size favors the dominant eye, with larger eye-specific differences seen in WT mice (see Table S3)."

      No way to judge the reliability of the analysis and trivial conclusion: To analyze effect size the authors choose the median value of three measures (whatever the middle value is). They then make four comparisons at the time point where they observed the biggest difference in favor of their hypothesis. There is no way to determine how much we should trust these numbers besides spending time with the mislabeled scatter plots. The authors then claim that this analysis provides evidence that there is a difference in vGluT2 cluster volume between dominant and non-dominant RGCs and that that difference is activity dependent. The conclusion that dominant axons have bigger boutons and that mutants that lack the property that would drive segregation would show less of a difference is very consistent with the literature. Moreover, there is no context provided about what 1.35 or 1.1 fold difference means for the biology of the system.

      We focused on P4 for biological reasons rather than post-hoc selection. P4 represents the established peak of synaptic competition when eye-specific synapse densities are globally equivalent. This is a timepoint consistently highlighted throughout our manuscript and supported by previous literature. We have modified our presentation from fold changes to measured eye-specific differences in volume (mean ± standard error) and added confidence intervals in Supplemental Table S3. The effect sizes for eye-specific differences in VGluT2 volume at P4 are robust: ~2.3 and ~1.5 for mAZ and sAZ measurements in WT mice, and ~2.5 and ~1.8 in β2KO mice, with all analyses well-powered (Supplemental Table S2).

      We were unable to identify any mislabeled scatter plots and believe all figures are correctly labeled. While dominant-eye advantage in bouton size is consistent with previous literature, our study provides the first detailed analysis of how this develops specifically during the critical period of competition, with distinct patterns for single versus multi-active zone contacts. Our data show that dominant-eye inputs have larger vesicle pools that scale with active zone number. While this suggests enhanced transmission capacity, we make no direct physiological claims based on structural data alone.

      Claim (189): "This shows that vesicle docking at release sites favors the dominant-eye as we previously reported but is similar for like eye type inputs regardless of AZ number."

      Contradicts core claim of manuscript: Consistent with previous literature, there is an activity dependent relative increase in vGlut2 clustering of dominant eye RGCs. The new information is that that activity dependence is more or less the same in sAZ and mAZ. The only plausible alternative is that vGlut2 scaling only increases in mAZ which would be consistent with the claims of their paper. That is not what they found. To the extent that the analysis presented in this manuscript tests a hypothesis, this is it. The claim of the title has been refuted by figure 3.

      We report the volume of docked vesicle signal (VGluT2) nearby each active zone, finding this is greater for dominant-eye synapses. Within each eye-specific synapse population, vesicle signal per active zone is similar regardless of whether these are part of single- or multi-active zone contacts. This is consistent with a modular program of active zone assembly and maintenance: core molecular programs facilitate docking at each AZ similarly regardless of how many AZs are nearby. 

      This finding does not contradict our main conclusions but rather provides insight into how synaptic advantages are structured. The dominant eye's advantage may arise in part from forming more multi-AZ contacts (which have proportionally more docked vesicles) rather than from enhanced vesicle loading per individual active zone. This organization may reflect how developmental competition operates through contact number and active zone addition rather than fundamental changes to individual release site properties.

      We have changed the title to be descriptive rather than mechanistic.

      Claim (line 235): "For the non-dominant eye projection, however, clustered mAZ inputs outnumbered clustered sAZ inputs at P4 (Fig. 4C, bottom left panel), the age when this eye adds sAZ synapses (Fig. 2C)."

      Misleading: The overwhelming trend across 24 comparisons is that the sAZ clustering looks like mAZ clustering. That is the objective and unambiguous result. Among these 24 underpowered tests (n=3), there were a few p-values < 0.05. The authors base their interpretation of cell behavior on these crossings.

      In Figures 4C and 4D we report significant results with high effect sizes (effect sizes all greater than 2; see Supplemental Table S2). The mean differences are modest (5-7%) and significance arises due to low variance between biological replicates. We acknowledge that clustering patterns are generally similar between mAZ and sAZ inputs across most conditions. We have revised the text to describe these as “slight” differences and that “WT mice show a tendency toward forming more synapses near mAZ inputs”, reflecting appropriate caution in our interpretation while noting the statistical consistency of these patterns.

      Claim (line 328): "The failure to add synapses reduced synaptic clustering and more inputs formed in isolation in the mutants compared to controls."

      Trivially true: Density was lower in mutant.

      We have rewritten the sentence for clarity: “The failure to add synapses could explain the observation that synaptic clustering was reduced and more inputs formed in isolation in the mutants compared to controls.”

      Claim (line 332): "While our findings support a role for spontaneous retinal activity in presynaptic release site addition and clustering..."

      Not meaningfully supported by evidence: I could not find meaningful differences between WT and mutant beside the already known dramatic difference in synapse density.

      We have changed the sentence to avoid overinterpreting the results. The new sentence in lines 415-417 reads: “While our results highlight developmental changes in presynaptic release site addition and clustering, activity-dependent postsynaptic mechanisms also influence input refinement at later stages.”

      Reviewer #2 (Public review):

      Summary:

      In this manuscript, Zhang and Speer examine changes in the spatial organization of synaptic proteins during eye specific segregation, a developmental period when axons from the two eyes initially mingle and gradually segregate into eye-specific regions of the dorsal lateral geniculate. The authors use STORM microscopy and immunostain presynaptic (VGluT2, Bassoon) and postsynaptic (Homer) proteins to identify synaptic release sites. Activity-dependent changes of this spatial organization are identified by comparing the β2KO mice to WT mice. They describe two types of synapses based on Bassoon clustering: the multiple active zone (mAZ) synapse and single active zone (sAZ) synapse. In this revision, the authors have added EM data to support the idea that mAZ synapses represent boutons with multiple release sites. They have also reanalyzed their data set with different statistical approaches.

      Strengths:

      The data presented is of good quality and provides an unprecedented view at high resolution of the presynaptic components of the retinogeniculate synapse during active developmental remodeling. This approach offers an advance to the previous mouse EM studies of this synapse because of the CTB label allows identification of the eye from which the presynaptic terminal arises.

      Weaknesses:

      While the interpretation of this data set is much more grounded in this second revised submission, some of the authors' conclusions/statements still lack convincing supporting evidence. In particular, the data does not support the title: "Eye-specific active zone clustering underlies synaptic competition in the developing visual system". The data show that there are fewer synapses made for both contra- and ipsi- inputs in the β2KO mice-- this fact alone can account for the differences in clustering. There is no evidence linking clustering to synaptic competition. Moreover, the findings of differences in AZ# or distance between AZs that the authors report are quite small and it is not clear whether they are functionally meaningful.

      We thank the reviewer for their helpful suggestions that improved the manuscript in this revision. We have changed the title to remove the reference to “clustering” and to avoid implying any causal relationships. The new title is descriptive: “Eye-specific differences in active zone addition during synaptic competition in the developing visual system”.

      To further address the reviewers comments, we have removed the remaining references to activity-dependent effects on synaptic development (line 36, line 96, line 415). We have also modified the text in lines 411-413 to state that “The failure to add synapses could explain the observation that synaptic clustering was reduced and more inputs formed in isolation in the mutants compared to controls.”

      We have also updated our presentation of results for Figure 4 to ensure that we do not causally link clustering to synaptic competition. In Figures 4C and 4D we report significant results with high effect sizes (effect sizes all greater than 2; see Supplemental Table S2). The mean differences are modest (5-7%) and significance arises due to low variance between biological replicates. We acknowledge that clustering patterns are generally similar between mAZ and sAZ inputs across most conditions. We have revised the text to describe these as “slight” differences and that “WT mice show a tendency toward forming more synapses near mAZ inputs”, reflecting appropriate caution in our interpretation while noting the statistical consistency of these patterns.

      Reviewer #3 (Public review):

      This study is a follow-up to a recent study of synaptic development based on a powerful data set that combines anterograde labeling, immunofluorescence labeling of synaptic proteins, and STORM imaging (Cell Reports, 2023). Specifically, they use anti-Vglut2 label to determine the size of the presynaptic structure (which they describe as the vesicle pool size), anti-Bassoon to label active zones with the resolution to count them, and anti-Homer to identify postsynaptic densities. Their previous study compared the detailed synaptic structure across the development of synapses made with contraprojecting vs. ipsi-projecting RGCs and compared this developmental profile with a mouse model with reduced retinal waves. In this study, they produce a new detailed analysis on the same data set in which they classify synapses into "multi-active zone" vs. "single-active zone" synapses and assess the number and spacing of these synapses. The authors use measurements to make conclusions about the role of retinal waves in the generation of same-eye synaptic clusters. The authors interpret these results as providing insight into how neural activity drives synapse maturation, the strength of their conclusions is not directly tested by their analysis.

      Strengths:

      This is a fantastic data set for describing the structural details of synapse development in a part of the brain undergoing activity-dependent synaptic rearrangements. The fact that they can differentiate the eye of origin is what makes this data set unique over previous structural work. The addition of example images from the EM dataset provides confidence in their categorization scheme.

      Weaknesses:

      Though the descriptions of single vs multi-active zone synapses are important and represent a significant advance, the authors continue to make unsupported conclusions regarding the biological processes driving these changes. Although this revision includes additional information about the populations tested and the tests conducted, the authors do not address the issue raised by previous reviews. Specifically, they provide no assessment of what effect size represents a biologically meaningful result. For example, a more appropriate title is "The distribution of eye-specific single vs multiactive zone is altered in mice with reduced spontaneous activity" rather than concluding that this difference in clustering is somehow related to synaptic competition. Of course, the authors are free to speculate, but many of the conclusions of the paper are not supported by their results.

      We appreciate the reviewer’s helpful critique. We have changed the title to be descriptive and avoid implying causal relationships. 

      We have applied false discovery rate (FDR) correction using the Benjamini-Hochberg method with α = 0.05 within each experimental condition (age × genotype combination). The FDR correction treats each condition as addressing a distinct experimental question: 'What synaptic properties differ between left eye and right eye inputs in this specific developmental stage and genotype?'

      This correction strategy is appropriate because: 1) we focus our statistical comparisons within each age/genotype; 2) each age-genotype combination represents a separate biological context where different synaptic properties between eye-of-origin may be relevant; and 3) this approach controls for multiple testing within each experimental question while maintaining statistical power to detect meaningful biological differences.

      We applied FDR correction separately to the ~20-34 measurements (varying with age and genotype) within each of the six experimental conditions (P2-WT, P2-ß2, P4-WT, P4-ß2, P8-WT, P8-ß2), resulting in condition-specific adjusted p-values. These are reported in the updated Supplemental Table S2. Figures have been also been updated to reflect the FDR-adjusted values. Selected between-genotype comparisons are presented descriptively using 5/95% confidence intervals. This correction confirmed the robustness of our key findings.

      With regard to the biological significance of effect sizes, our key findings demonstrate effect sizes >2.0, indicating robust effects. During critical developmental periods, consistent structural differences, even those modest in absolute magnitude, can reflect important regulatory mechanisms that influence refinement outcomes. The differences in synaptic organization we observe occur during the first postnatal week when eyespecific competition is active, suggesting these patterns may be relevant to understanding how structural advantages emerge during synaptic refinement.

      Reviewer #1 (Recommendations for the authors):

      I have tried to understand the analysis and biology of this manuscript as best I can. I believe the analytical approach taken is not reliable and I have explained why in my public comments. I don't believe this manuscript is unique in taking this approach. I have recently published a paper on how common this approach is and why it doesn't work. I don't want to give the impression that the problem with the analysis was that it was not computationally sophisticated enough or that you did not jump through a specific statistical hoop. If I strip out the arguments that depend on misinterpretations of p-values and -instead- look at the scatterplots, I come up with a very different view of the data than what is described in the paper.

      The information in the plots could be translated into a rigorous statistical analysis of estimated differences between groups given the uncertainties of the experimental design. I don't really think that analysis would be useful. I think it would have been enough to publish the plots and report your estimates of the number of active zones in RGCs during development. I don't see evidence of an additional effect.

      We appreciate the reviewer’s helpful comments throughout the review process. Mean active zone numbers per mAZ contact are presented in Figure S2D/E. We look forward to further technical and computational advances that will help us increase our data acquisition throughput and sample sizes when designing future studies. 

      Reviewer #2 (Recommendations for the authors):

      The authors should modify the title and other text to be more consistent with the data. There is no evidence that active zone clustering has any direct relationship to synaptic competition.

      We appreciate the reviewer’s helpful suggestions to ensure appropriate language around causal effects. We have modified the title to accurately reflect the results: "Eyespecific differences in active zone addition during synaptic competition in the developing visual system." We have revised the text in the abstract, introduction, and results section for Figures 4 to be consistent with the data and not imply causality of synapse clustering on segregation phenotypes.

      Reviewer #3 (Recommendations for the authors):

      Change the title.

      We appreciate the reviewer’s feedback throughout the review process. We have modified the title to accurately reflect the results: "Eye-specific differences in active zone addition during synaptic competition in the developing visual system."

    1. Unclear Privacy Rules: Sometimes privacy rules aren’t made clear to the people using a system. For example: If you send “private” messages on a work system, your boss might be able to read them [i19]. When Elon Musk purchased Twitter, he also was purchasing access to all Twitter Direct Messages [i20] Others Posting Without Permission: Someone may post something about another person without their permission. See in particular: The perils of ‘sharenting’: The parents who share too much [i21] Metadata: Sometimes the metadata that comes with content might violate someone’s privacy. For example, in 2012, former tech CEO John McAfee was a suspect in a murder in Belize [i22], John McAfee hid out in secret. But when Vice magazine wrote an article about him, the photos in the story contained metadata with the exact location in Guatemala [i23]. Deanonymizing Data: Sometimes companies or researchers release datasets that have been “anonymized,” meaning that things like names have been removed, so you can’t directly see who the data is about. But sometimes people can still deduce who the anonymized data is about. This happened when Netflix released anonymized movie ratings data sets, but at least some users’ data could be traced back to them [i24]. Inferred Data: Sometimes information that doesn’t directly exist can be inferred through data mining (as we saw last chapter), and the creation of that new information could be a privacy violation. This includes the creation of Shadow Profiles [i25], which are information about the user that the user didn’t provide or consent to Non-User Information: Social Media sites migh

      This section makes me think on the internet nowadays, there's absolutely no way to keep your information to yourself. People's information is in so many different companies, and users would not know how their information is being used either. Users has no control over their own privacy although it's something about themselves.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      We thank the reviewer for very enthusiastic and supportive comments on our manuscript. 

      Summary:

      This manuscript presents a compelling and innovative approach that combines Track2p neuronal tracking with advanced analytical methods to investigate early postnatal brain development. The work provides a powerful framework for exploring complex developmental processes such as the emergence of sensory representations, cognitive functions, and activity-dependent circuit formation. By enabling the tracking of the same neurons over extended developmental periods, this methodology sets the stage for mechanistic insights that were previously inaccessible.

      Strengths:

      (1) Innovative Methodology:

      The integration of Track2p with longitudinal calcium imaging offers a unique capability to follow individual neurons across critical developmental windows.

      (2) High Conceptual Impact:

      The manuscript outlines a clear path for using this approach to study foundational developmental questions, such as how early neuronal activity shapes later functional properties and network assembly.

      (3) Future Experimental Potential:

      The authors convincingly argue for the feasibility of extending this tracking into adulthood and combining it with targeted manipulations, which could significantly advance our understanding of causality in developmental processes.

      (4) Broad Applicability:

      The proposed framework can be adapted to a wide range of experimental designs and questions, making it a valuable resource for the field.

      Weaknesses:

      No major weaknesses were identified by this reviewer. The manuscript is conceptually strong and methodologically sound. Future studies will need to address potential technical limitations of long-term tracking, but this does not detract from the current work's significance and clarity of vision.

      Reviewer #2 (Public review):

      Summary:

      The manuscript by Majnik and colleagues introduces "Track2p", a new tool designed to track neurons across imaging sessions of two-photon calcium imaging in developing mice. The method addresses the challenge of tracking cells in the growing brain of developing mice. The authors showed that "Track2p" successfully tracks hundreds of neurons in the barrel cortex across multiple days during the second postnatal week. This enabled the identification of the emergence of behavioral state modulation and desynchronization of spontaneous network activity around postnatal day 11.

      Strengths:

      The manuscript is well written, and the analysis pipeline is clearly described. Moreover, the dataset used for validation is of high quality, considering the technical challenges associated with longitudinal two-photon recordings in mouse pups. The authors provide a convincing comparison of both manual annotation and "CellReg" to demonstrate the tracking performance of "Track2p". Applying this tracking algorithm, Majnik and colleagues characterized hallmark developmental changes in spontaneous network activity, highlighting the impact of longitudinal imaging approaches in developmental neuroscience. Additionally, the code is available on GitHub, along with helpful documentation, which will facilitate accessibility and usability by other researchers.

      Weaknesses:

      (1) The main critique of the "Track2p" package is that, in its current implementation, it is dependent on the outputs of "Suite2p". This limits adoption by researchers who use alternative pipelines or custom code. One potential solution would be to generalize the accepted inputs beyond the fixed format of "Suite2p", for instance, by accepting NumPy arrays (e.g., ROIs, deltaF/F traces, images, etc.) from files generated by other software. Otherwise, the tool may remain more of a useful add-on to "Suite2p" (see https://github.com/MouseLand/suite2p/issues/933) rather than a fully standalone tool.

      We thank the reviewer for this excellent suggestion. 

      We have now implemented this feature, where Track2p is now compatible with ‘raw’ NumPy arrays for the three types of inputs. For more information, please check the updated documentation: https://track2p.github.io/run_inputs_and_parameters.html#raw-npy-arrays. We have also tested this feature using a custom segmentation and trace extraction pipeline using Cellpose for segmentation.

      (2) Further benchmarking would strengthen the validation of "Track2p", particularly against "CaIMaN" (Giovannucci et al., eLife, 2019), which is widely used in the field and implements a distinct registration approach.

      This reviewer suggested  further benchmarking of Track2P.  Ideally, we would want to benchmark Track2p against the current state-of-the-art method. However, the field currently lacks consensus on which algorithm performs best, with multiple methods available including CaIMaN, SCOUT (Johnston et al. 2022), ROICaT (Nguyen et al. 2023), ROIMatchPub (recommended by Suite2p documentation and recently used by Hasegawa et al. 2024), and custom pipelines such as those described by Sun et al. 2025. The absence of systematic benchmarking studies—particularly for custom tracking pipelines—makes it impossible to identify the current state-of-the-art for comparison with Track2p. While comparing Track2p against all available methods would provide comprehensive evaluation, such an analysis falls beyond the scope of this paper.

      We selected CellReg for our primary comparison because it has been validated under similar experimental conditions—specifically, 2-photon calcium imaging in developing hippocampus between P17-P25 (Wang et al. 2024)—making it the most relevant benchmark for our developmental neocortex dataset.

      That said, to support further benchmarking in mouse neocortex (P8-P14), we will publicly release our ground truth tracking dataset.

      (3) The authors might also consider evaluating performance using non-consecutive recordings (e.g., alternate days or only three time points across the week) to demonstrate utility in other experimental designs.

      Thank you for your suggestion. We have performed a similar analysis prior to submission, but we decided against including it in the final manuscript, to keep the evaluation brief and to not confuse the reader with too many different evaluation methods. We have included the results inAuthor response images 1 and 2 below.

      To evaluate performance in experimental designs with larger time spans between recordings (>1 day) we performed additional evaluation of tracking from P8 to each of the consecutive days while omitting the intermediate days (e. g. P8 to P9, P8 to P10 … P8 to P14). The performance for the three mice from the manuscript is shown below:

      Author response image 1.

      As expected with increasing time difference between the two recordings the performance drops significantly (dropping to effectively zero for 2 out of 3 mice). This could also explain why CellReg struggles to track cells across all days, since it takes P8 as a reference and attempts to register all consecutive days to that time point before matching, instead of performing registration and matching in consecutive pairs of recordings (P8-P9, P9-P10 … P13-P14) as we do.

      Finally for one of the three mice we also performed an additional test where we asked how adding an additional recording day might rescue the P8-P14 tracking performance. This corresponds to the comment from the reviewer, answering the question if we can only perform three days of recording which additional day would give the best tracking performance. 

      Author response image 2.

      As can be seen from the plot, adding the P10 or P11 recording shows the most significant improvement to the tracking performance, however the performance is still significantly lower than when including all days (see Fig. 4). This test suggests that including a day that is slightly skewed to earlier ages might improve the performance more than simply choosing the middle day between the two extremes. This would also be consistent with the qualitative observation that the FOV seems to show more drastic day-to-day changes at earlier ages in our recording conditions.

      Reviewer #3 (Public review):

      Summary:

      In this manuscript, Majnik et al. developed a computational algorithm to track individual developing interneurons in the rodent cortex at postnatal stages. Considerable development in cortical networks takes place during the first postnatal weeks; however, tools to study them longitudinally at a single-cell level are scarce. This paper provides a valuable approach to study both single-cell dynamics across days and state-driven network changes. The authors used Gad67Cre mice together with virally introduced TdTom to track interneurons based on their anatomical location in the FOV and AAVSynGCaMP8m to follow their activity across the second postnatal week, a period during which the cortex is known to undergo marked decorrelation in spontaneous activity. Using Track2P, the authors show the feasibility of tracking populations of neurons in the same mice, capturing with their analysis previously described developmental decorrelation and uncovering stable representations of neuronal activity, coincident with the onset of spontaneous active movement. The quality of the imaging data is compelling, and the computational analysis is thorough, providing a widely applicable tool for the analysis of emerging neuronal activity in the cortex. Below are some points for the authors to consider.

      We thank the reviewer for a constructive and positive evaluation of our MS. 

      Major points:

      (1) The authors used 20 neurons to generate a ground truth dataset. The rationale for this sample size is unclear. Figure 1 indicates the capability to track ~728 neurons. A larger ground truth data set will increase the robustness of the conclusions.

      We think this was a misunderstanding of our ground truth dataset analysis which included 192 and not 20 neurons. Indeed, as explained in the methods section, since manually tracking all cells would require prohibitive amounts of time, we decided to generate sparse manual annotations, only tracking a subset of all cells from the first recording day onwards. To do this, we took the first recording (s0), and we defined a grid 64 equidistant points over the FOV and, for each point, identified the closest ROI in terms of euclidean distance from the median pixel of the ROI (see Fig. S3A). We then manually tracked these 64 ROIs across subsequent days. Only neurons that were detected and tracked across all sessions were taken into account and referred to as our ground truth dataset (‘GT’ in Fig. 4). This was done for 3 mice, hence 3X64 neurons and not 20 were used to generate our GT dataset. 

      (2) It is unclear how movement was scored in the analysis shown in Figure 5A. Was the time that the mouse spent moving scored after visual inspection of the videos? Were whisker and muscle twitches scored as movement, or was movement quantified as the amount of time during which the treadmill was displaced?

      Movement was scored using a ‘motion energy’ metric as in Stringer et al. 2019 (V1) or Inácio et al. 2025 (S1). This metric takes each two consecutive frames of the videography recordings and computes the difference between them by summing up the square of pixelwise differences between the two images. We made the appropriate changes in the manuscript to further clarify this in the main text and methods in order to avoid confusion.

      Since this metric quantifies global movements, it is inherently biased to whole-body movements causing more significant changes in pixel values around the whole FOV of the camera. Slight twitches of a single limb, or the whisker pad would thus contribute much less to this metric, since these are usually slight displacements in a small region of the camera FOV. Additionally, comparing neural activity across all time points (using correlation or R<sup>2</sup>) also favours movements that last longer (such as wake movements / prolonged periods of high arousal) since each time point is treated equally.

      As we suggested in the discussion, in further analysis it would be interesting to look at the link between twitches and neural activity, but this would likely require extensive manual scoring. We could then treat movements not as continuous across all time-points, but instead using event-based analysis for example peri-movement time histograms for different types of movements at different ages, which is however outside of the scope of this study.

      (3) The rationale for binning the data analysis in early P11 is unclear. As the authors acknowledged, it is likely that the decoder captured active states from P11 onwards. Because active whisking begins around P14, it is unlikely to drive this change in network dynamics at P11. Does pupil dilation in the pups change during locomotor and resting states? Does the arousal state of the pups abruptly change at P11?

      We agree that P11 does not match any change in mouse behavior that we have been able to capture. However, arousal state in mice does change around postnatal day 11. This period marks a transition from immature, fragmented states to more organized and regulated sleep-wake patterns, along with increasing influence from neuromodulatory and sensory systems. All of these changes have been recently reviewed in Wu et al. 2024 (see also Martini et al. 2021). In addition, in the developing somatosensory system, before postnatal day 11 (P11), wake-related movements (reafference) are actively gated and blocked by the external cuneate nucleus (ECN, Tiriac et al. 2016 and all excellent recent work from the Blumberg lab). This gating prevents sensory feedback from wake movements from reaching the cortex, ensuring that only sleep-related twitches drive neural responses. However, around P11, this gating mechanism abruptly lifts, enabling sensory signals from wake movements to influence cortical processing—signaling a dramatic developmental shift from Wu et al. 2024

      Reviewer #1 (Recommendations for the authors):

      This manuscript represents a significant advancement in the field of developmental neuroscience, offering a powerful and elegant framework for longitudinal cellular tracking using the Track2p method combined with robust analytical approaches. The authors convincingly demonstrate that this integrated methodology provides an invaluable template for investigating complex developmental processes, including the emergence of sensory representations and higher cognitive functions.

      A major strength of this work is its emphasis on the power of longitudinal imaging to illuminate activity-dependent development. By tracking the same neurons over time, the authors open up new possibilities to uncover how early activity patterns shape later functional outcomes and the organization of neuronal assemblies-insights that would be inaccessible using conventional cross-sectional designs.

      Importantly, the manuscript highlights the potential for this approach to be extended even further, enabling continuous tracking into adulthood and thus offering an unprecedented window into long-term developmental trajectories. The authors also underscore the exciting opportunity to incorporate targeted perturbation experiments, allowing researchers to causally link early circuit dynamics to later outcomes.

      Given the increasing recognition that early postnatal alterations can underlie the etiology of various neurodevelopmental disorders, this work is especially timely. The methods and perspectives presented here are poised to catalyze a new generation of developmental studies that can reveal mechanistic underpinnings of both typical and atypical brain development.

      In summary, this is a technically impressive and conceptually forward-looking study that sets the stage for transformative advances in developmental neuroscience.

      Thank you for the thoughtful feedback—it's greatly appreciated!

      Reviewer #2 (Recommendations for the authors):

      Minor points:

      (1) Figure 1. Consider merging or moving to Supplemental, as its rationale is well described in the text.

      We would like to retain the current figure as we believe it provides an effective visual illustration of our rationale that will capture readers' attention and could serve as a valuable reference for others seeking to justify longitudinal tracking of the developing brain. We hope the reviewer will understand our decision.

      (2) Some axis labels and panels are difficult to read due to small font sizes (e.g. smaller panels in Figures 5-7).

      Modified, thanks 

      (3) Supplementary Figures. The order of appearance in the main text is occasionally inconsistent.

      This was modified, thanks

      (4) Line 132. Add a reference to the registration toolbox used (elastix). A brief description of the affine transformation would also be helpful, either here or in the Methods section (p. 27).

      We have added reference to Ntatsis et al. 2023 and described affine transformation in the main text (lines 133-135): 

      Firstly, we estimate the spatial transformation between s0 and s1 using affine image registration (i.e. allowing shifting, rotation, scaling and shearing, see Fig. 2B, the transformation is denoted as T).

      (5) Lines 147-151. If this method is adapted from another work, please cite the source.

      Computing the intersection over union of two ROIs for tracking is a widely established and intuitive method used across numerous studies, representing standard practice rather than requiring specific citation. We have however included the reference to the paper describing the algorithm we use to solve the linear sum assignment problem used for matching neurons across a pair of consecutive days (Crouse 2016).

      (6) Line 218. "classical" or automatic?

      We meant “classical” in the sense of widely used. 

      (7) Lines 220-231. Did the authors find significant variability of successfully tracked neurons across mice? While the data for successfully tracked cells is reported (Figure 5B), the proportions are not. Could differences in neuron dropout across days and mice affect the analysis of neuronal activity statistics?

      We thank the reviewer for raising this important point. We computed the fraction of successfully tracked cells in our dataset and found substantial variability:

      Cells detected on day 0: [607, 1849, 2190, 1988, 1316, 2138] 

      Proportion successfully tracked: [0.47, 0.20, 0.36, 0.37, 0.41, 0.19]

      Notably, the number of cells detected on the first day varies considerably (607–2138 cells). There appears to be a trend whereby datasets with fewer initially detected cells show higher tracking success rates, potentially because only highly active cells are identified in these cases.

      To draw more definitive conclusions about the proportion of active cells and tracking dropout rates, we would require activity-independent cell detection methods (such as Cellpose applied to isosbestic 830 nm fluorescence, or ideally a pan-neuronal marker in a separate channel, e.g., tdTomato). We have incorporated the tracking success proportions into the revised manuscript.

      (8) Line 260. Please briefly explain, here or in the Methods, the rationale for using data from only 3 mice (rather than all 6) for evaluating tracking performance.

      We used three mice for this analysis due to the labor-intensive nature of manually annotating 64 ROIs across several days. Given the time constraints of this manual process, we determined that three subjects would provide adequate data to reliably assess tracking performance.

      (9) Line 277. Consider clarifying or rephrasing the phrase "across progressively shorter time intervals"? Do you mean across consecutive days?

      This has been rephrased as follows: 

      Additionally, to assess tracking performance over time, we quantified the proportion of reconstructed ground truth tracks over progressively longer time intervals (first two days, first three days etc. ‘Prop. correct’ in Fig. 4C-F, see Methods). This allowed us to understand how tracking accuracy depends on the number of successive sessions, as well as at which time points the algorithm might fail to successfully track cells.

      (10) Line 306. "we also provide additional resources and documentation". Please add a reference or link.

      Done, thanks

      Track2p  

      (11) Lines 342-344. Specify that the raster plots refer to one example mouse, not the entire sample.

      Done, thanks.

      (12) Lines 996-1002. Please confirm whether only successfully tracked neurons were used to compute the Pearson correlations between all pairs.

      Yes of course, this only applies to tracked neurons as it is impossible to compute this for non-tracked pairs.

      (13) Line 1003. Add a reference to scikit-learn.

      Reference was added to: 

      Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, E. (2011). Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12, 2825–2830. 

      (14) Typos.Correct spacing between numeric values and units.

      We did not find many typos regarding spacing between the numerical value and the unit symbol (degrees and percent should not be spaced right?).

      Reviewer #3 (Recommendations for the authors):

      The font size in many of the figures is too small. For example, it is difficult to follow individual ROIs in Figure S3.

      Figure font size has been increased, thanks. In Figure S3 there might have been a misunderstanding, since the three FOV images do not correspond to the FOV of the same mouse across three days but rather to the first recording for each of the three mice used in evaluation (the ROIs can thus not be followed across images since they correspond to a different mouse). To avoid confusion we have labelled each of the FOV images with the corresponding mouse identifier (same as in Fig. 4 and 5).

    1. Author response:

      The following is the authors’ response to the original reviews

      Reviewer #1 (Public review): 

      Summary: 

      In this manuscript, the authors explore the role of the conserved transcription factor POU4-2 in planarian maintenance and regeneration of mechanosensory neurons. The authors explore the role of this transcription factor and identify potential targets of this transcription factor. Importantly, many genes discovered in this work are deeply conserved, with roles in mechanosensation and hearing, indicating that planarians may be a useful model with which to study the roles of these key molecules. This work is important within the field of regenerative neurobiology, but also impactful for those studying the evolution of the machinery that is important for human hearing. 

      Strengths: 

      The paper is rigorous and thorough, with convincing support for the conclusions of the work. 

      Weaknesses: 

      Weaknesses are relatively minor and could be addressed with additional experiments or changes in writing.

      Reviewer #2 (Public review): 

      Summary: 

      In this manuscript, the authors investigate the role of the transcription factor Smed-pou4-2 in the maintenance, regeneration, and function of mechanosensory neurons in the freshwater planarian Schmidtea mediterranea. First, they characterize the expression of pou4-2 in mechanosensory neurons during both homeostasis and regeneration, and examine how its expression is affected by the knockdown of soxB1, 2, a previously identified transcription factor essential for the maintenance and regeneration of these neurons. Second, the authors assess whether pou4-2 is functionally required for the maintenance and regeneration of mechanosensory neurons. 

      Strengths: 

      The study provides some new insights into the regulatory role of pou4-2 in the differentiation, maintenance, and regeneration of ciliated mechanosensory neurons in planarians. 

      Weaknesses: 

      The overall scope is relatively limited. The manuscript lacks clear organization, and many of the conclusions would benefit from additional experiments and more rigorous quantification to enhance their strength and impact. 

      Reviewing Editor Comments: 

      (1) Quantification of pou4-2(+) cells that express (or do not express) hmcn-1-L and/or pkd1L-2(-) is a common suggestion amongst reviewers. It is recognized that Ross et al. (2018) showed that pkd1L-2 and hmcn-1L expression is detected in separate cells by double FISH, and the analysis presented in Supplementary Figure S3 is helpful in showing that some cells expressing pou4-2 (magenta) are not labeled by the combined signal of pkd1L-2 and hmcn-1-L riboprobes (green). However, I am not sure that we can conclude that pkd1L-2 and hmcn-1-L are effectively detected when riboprobes are combined in the analysis. Therefore, quantification of labeled cells as proposed by Reviewers 1 and 2 would help.

      Combining riboprobes is a standard approach in the field, and we chose this method as a direct way to determine which cells lack expression of both genes. We agree that providing the raw quantification data would be helpful for readers, and we included this data in Supplementary File S7; the file contains the quantification information for this dFISH experiment represented in Supplementary Figure 3.

      (2) It may be helpful to comment on changes (or lack of changes) in atoh gene RNA levels in RNAseq analyses of pou4-2 animals. As mentioned by one of the reviewers, in situs that don't show signal are inconclusive in this regard. 

      We fully agree with both reviewers. Two of the planarian atonal homologs are difficult to detect and produce background signals, which we attempted and previously reported in Cowles et al. Development (2013). We conceived performing reciprocal RNAi/in situ experiments, born out of curiosity given the reported role of atonal in the pou4 cascade in other organisms. However, these exploratory experiments lacked a strong rationale for inclusion, particularly given that pou4-2 and the atonal homologs do not share expression patterns, co-expression, or differential expression in our RNA-seq dataset. Therefore, we decided to omit the atonal in situs following pou4-2 RNAi. We retained the experiments showing that knockdown of the atonal genes does not show robust effects on the mechanosensory neuron pattern, as expected. We thank the reviewing editor and reviewers for pinpointing the concern. We agree that additional experiments, such as qPCR experiments, would be needed. We reasoned that while these additional experiments could be informative, they are unlikely to alter the key conclusions of this study substantially.

      (3) There seem to be typos at bottom of Figure 10 and top of page 11 when referencing to Figure 4B (should be to 5B instead): "While mechanosensory neuronal patterned expression of Eph1 was downregulated after pou4-2 and soxB1-2 inhibition, low expression in the brain branches of the ventral cephalic ganglia persisted (Figure 4B)." 

      Thank you! We have fixed those.

      (4) Typo (page 13; kernel?): "...to test to what extent the Pou4 gene regulatory kernel is conserved among these widely divergent animals." 

      Regulatory kernels are defined as the minimal sets of interacting genes that drive developmental processes and are the core circuits within a gene regulatory network, but we recognize that this might not be as well known, so we have changed the term to “network” for clarity.

      Reviewer #1 (Recommendations for the authors): 

      (1) The authors indicate that they are interested in finding out whether POU4-2 is important in the creation of mechanosensory neurons in adulthood as well as in embryogenesis (in other words, whether the mechanism is "reused during adult tissue maintenance and regeneration"). The manuscript clearly shows that planarian POU4 -2 is important in adult neurogenesis in planarians, but there is no evidence presented to show that this is a recapitulation of embryogenesis. Is pou4-2 expressed in the planarian embryo? This might be possible to examine by ISH or through the evaluation of sequencing data that already exists in the literature. 

      We agree that these statements should be precise. We have clarified when we make comparisons to the role of Pou4 in sensory system development in other organisms versus its role in the adult planarian. We examined its expression using the existing database of embryonic gene expression. Thanks for hinting at this idea. We performed BLAST in Planosphere (Davies et al., 2017) to cross-reference our clone matching dd_Smed_v6_30562_0_1, which is identical to SMED30002016. The embryonic gene expression for SMED30002016 indicates this gene is expressed at the expected stages given prior knowledge of the timing of organ development in Schmidtea mediterranea (a positive trend begins at Stage 5, with a marked increase by Stage 6 that remains comparable to the asexual expression levels shown). We thank the reviewer for pointing out this oversight. We have incorporated this result in the paper as a Supplementary Figure and discuss how we can only speculate that it has a similar role as we detect in the adult asexual worms.

      (2) Can it be determined whether the punctate pou4-2+ cells outside of the stripes are progenitors or other neural cell types? Are there pou4-2+ neurons that are not mechanosensory cell types? Could there be other roles for POU4-2 in the neurogenesis of other cell types? It might help to show percentages of overlap in Figure 4A and discuss whether the two populations add up to 100% of cells. 

      These are good questions that arise in part from other statements that need clarification in the text (pointed out by Reviewer 2). We think some of the dorsal pou4-2<sup>+</sup> might represent progenitor cells undergoing terminal differentiation (see Supplementary Figure 4). We attempted BrdU pulse chase experiments but were not successful in consistently detecting pou4-2 at sufficient levels with our protocol. In response to this helpful comment, we have included this question as a future direction in the revised Discussion. Finally, we have edited our description of the expression pattern. We already pointed out that there are other cells on the ventral side that are not affected when soxB1-2 is knocked down. We attempted to resolve the potential identity of those cells working with existing scRNA-seq data in collaboration with colleagues, but their low abundance made it difficult to distinguish other populations. While we acknowledge this interesting possibility, we have chosen to focus this report on the role of pou4-2 downstream of soxB1-2, as this represents the most well-supported aspect of the dataset and was positively highlighted by both the reviewer and editor.

      (3) The authors discuss many genes from their analysis that play conserved roles in mechanosensation and hearing. Were there any conserved genes that came up in the analysis of pou4-2(RNAi) planarians that have not yet been studied in human hearing and neurodevelopment? I am wondering the extent to which planarians could be used as a discovery system for mechanosensory neuron function and development, and discussion of this point might increase the impact of this paper or provide critical rationale for expanding work on planarian mechanosensation. 

      Indeed, we agree that planarians could be used to identify conserved genes with roles in mechanosensation and have included this point in the Discussion. In this study, we have focused on demonstrating the conservation of gene regulation. While this study was initially based on a graduate thesis project, we have since generated a more comprehensive dataset from isolated heads, which we are currently analyzing. This has been emphasized in the revised Discussion.

      Minor: 

      (1) For Figure 6E, the authors could consider showing data along a negative axis to indicate a decrease in length in response to vibration and to more clearly show that this decrease doesn't occur as strongly after pou4-2(RNAi). 

      We displayed this behavior as the percent change, as this is a standard way to represent this data. As the percent change is a positive value, we represent the data as these positive values.

      (2) The authors should consider quantifying the decrease of pou4-2 mRNA after atonal(RNAi) conditions, either by RT-qPCR or cell quantification. Visually, the signal in the stripes after atoh8-2(RNAi) seems lower, particularly in the tail. The punctate pattern outside the stripes may also be decreased after atoh8-1(RNAi). But quantification might strengthen the argument. 

      We agree with the reviewer and acknowledge that we should have been more cautious in interpreting these results. Those two genes are difficult to detect and did not show specific patterns in Cowles et al. (2013). The reviewer is correct that additional experiments are necessary before reaching conclusions, but we do not think as discussed earlier we do not think new experiments would provide insights for the major conclusions. These experiments were exploratory in nature and tangential to our main conclusions, especially in the absence of reciprocal evidence (e.g., shared expression patterns, co-expression, or differential expression in our RNA-seq data. Therefore, we decided to eliminate the atonal in situs following pou4-2 RNAi.

      Reviewer #2 (Recommendations for the authors): 

      A. Expression of pou4-2 in ciliated mechanosensory neurons: 

      (1) The conclusion that pou4-2 is expressed in ciliated mechanosensory neurons is primarily based on co-expression analysis using a published single-cell dataset. Although the authors later show that a subset of pou4-2 cells also express pkd1L-2 (Figure 4A), a known marker of ciliated mechanosensory neurons, this finding is not properly quantified. I recommend moving Figure 4A to earlier in the manuscript (e.g., to Figure 2) and expanding the analysis to include additional known markers of this cell type. Proper quantification of the extent of co-localization is necessary to support the claim robustly. 

      As pointed out by the reviewer, there is substantive evidence from our lab and other reports. King et al. also showed pou4-2 and pkd1L-2 ‘regulation’ by their scRNA-seq data, and this function is conserved in the acoel Hofstenia miamia (Hulett et al., PNAS 2024 ). Our analysis shows convincing co-localization by scRNA-seq and expression of soxB1-2 and neural markers in the respective populations. Furthermore, we included colocalization of pou4-2 with mechanosensory genes using fluorescence in situ hybridization (Figure 3B, Supplementary Figure 4, and Supplementary File S7). We are confident the data conclusively show pou4-2 regulates pkd1L-2 expression in a subset of mechanosensory neurons. Given the strength of existing observations and previously published data, we believe that additional staining experiments are not essential to support this conclusion. 

      (2) There appears to be a conceptual inconsistency in the interpretation of pou4-2 expression dynamics. On one hand, the authors suggest that delayed pou4-2 expression indicates a role in late-stage differentiation (p.6). On the other hand, they propose that pou4-2 may be expressed in undifferentiated progenitors to initiate downstream transcriptional programs (p.8). These interpretations should be reconciled. Additionally, claims regarding pou4-2 expression in progenitor populations should be supported by co-localization with established stem cell or progenitor markers, rather than inferred from signal intensity alone. 

      This is an excellent point, and we agree with the reviewer that this section requires editing. As described in response to Reviewer 1, we attempted BrdU pulse chase experiments but were not successful in consistently detecting pou4-2 at sufficient levels with our protocol. Furthermore, we could not obtain strong signals in double labeling experiments in pou4-2 in situs combined with piwi-1 or PIWI-1 antibodies. We will include those experiments as a future direction and amend our conclusions accordingly.

      (3) The expression pattern shown in Figure 1B raises questions about the precise anatomical localization of pou4-2 cells. It is unclear whether these cells reside in the subepidermal plexus or the deeper submuscular plexus, which represent distinct neuronal layers (Ross et al., 2017). The observed signals near the ventral nerve cords could suggest submuscular localization. To clarify this, higher-resolution imaging and co-staining with region-specific neural markers are recommended. 

      In Ross et al. (2018), we showed that the pkd1L-2<sup>+</sup> cells are located submuscularly. The pkd1L-2 cells express pou4-2, thus the pou4-2<sup>+</sup> cells are located in the same location. Based on co-expression data and co-expression with PKD genes, we are confident it is submuscular.

      B. The functional requirements of pou4-2 in the maintenance of mechanosensory neurons: 

      (1) To evaluate the functional role of pou4-2 in maintaining mechanosensory neurons, the authors performed whole-animal RNA-seq on pou4-2(RNAi) and control animals, identifying a significant downregulation of genes associated with mechanosensory neuron expression. However, the presentation of these findings is fragmented across Figures 3, 4, and 5. I recommend consolidating the RNA-seq results (Figure 3) and the subsequent validation of downregulated genes (Figures 4 and 5) into a single, cohesive figure. This would improve the logical flow and clarity of the manuscript. 

      As suggested by the reviewer, we have combined Figures 3 and 4 (new Figure 3), which we believe improves the flow. We decided to keep Figure 5 (new Figure 4) as a standalone because it focuses on the characterization of new genes revealed by RNAseq and scRNA-seq data mining that were not previously reported in Ross et al. 2018 and

      2024.

      (2) In pou4-2(RNAi) animals, pkd1L-2 expression appears to be entirely lost, while hmcn-1-L shows faint expression in scattered peripheral regions. The authors suggest that an extended RNAi treatment might be necessary to fully eliminate hmcn-1-L expression. However, an alternative explanation is that pou4-2 is not essential for maintaining all hmcn-1-L cells, particularly if pou4-2 expression does not fully overlap with that of hmcn-1-L. This possibility should be acknowledged and discussed. 

      We agree and have acknowledged this point in the revised text.

      (3) On page 9, the section title claims that "Smed-pou4-2 regulates genes involved in ciliated cell structure organization, cell adhesion, and nervous system development." While some differentially expressed genes are indeed annotated with these functions based on homology, the manuscript does not provide experimental evidence supporting their roles in these biological processes in planarians. The title should be revised to avoid overstatement, and the limitations of extrapolating a function solely from gene annotation should be acknowledged. 

      Excellent point. We have edited the text to indicate that the genes were annotated or implicated.

      (4) The cilia staining presented in Figure 6B to support the claim that pou4-2 is required for ciliated cell structure organization is unconvincing. Improved imaging and more targeted analysis (e.g., co-labeling with mechanosensory markers) are needed to support this conclusion. 

      We have addressed this concern by adjusting the language to be more precise and indicate that the stereotypical banded pattern is disrupted with decreased cilia labeling along the dorsal ciliated stripe. Indeed, our conclusion overstated the observations made with the staining and imaging resolution. Thank you.

      C. The functional requirements of pou4-2 in the regeneration of mechanosensory neurons: 

      To evaluate the role of pou4-2 in the regeneration of mechanosensory neurons, the authors performed amputations on pou4-2(RNAi) and control(RNAi) animals and assessed the expression of mechanosensory markers (pkd1L-2, hmcn-1-L) alongside a functional assay. However, the results shown in Figure 4B indicate the presence of numerous pkd1L-2 and hmcn-1-L cells in the blastema of pou4-2(RNAi) animals. This observation raises the possibility that pou4-2 may not be essential for the regeneration of these mechanosensory neurons. The authors should address this alternative interpretation. 

      Our interpretation is that there were very few cells expressing the markers compared to controls. The pattern was predominantly lost, which is consistent with other experiments shown in the paper. However, we have added the additional caveat suggested by the reviewer.

      Minor points: 

      (1) On p.8, the authors wrote "every 12 hours post-irradiation". However, this is not consistent with the figure, which only shows 0, 3, 4, 4.5, 5, and 5.5 dpi. 

      We corrected this. Thank you for catching the mistake!

      (2) On p.12, the authors wrote "Analysis of pou4-2 RNAi data revealed differentially expressed genes with known roles in mechanosensory functions, such as loxhd-1, cdh23, and myo7a. Mutations in these genes can cause a loss of mechanosensation/transduction". This is misleading because, to my knowledge, the role of these genes in planarians is unknown. If the authors meant other model systems, they should clearly state this in the text and include proper references. 

      The reviewer is correct that we are referencing findings from other organisms. We have clarified this point in the revised text. The appropriate references were included and cited in the first version.

      (3) On p.7, the authors wrote, "conversely, the expression of atonal genes was unaffected in pou4-2 RNAi-treated regenerates (Supplementary Figure S2B)". However, it is unclear whether the Atoh8-1 and Atoh8-2 signals are real, as the quality of the in situ results is too low to distinguish between real signals and background noise/non-specific staining. 

      This valid concern was addressed in our response to Reviewer 1. We have adjusted the figure and the text accordingly.

      (4) On p.6 the authors wrote "pinpointed time points wherein the pou4-2 transcripts were robustly downregulated". However, the current version of the manuscript does not provide data explaining why Pou4-2 transcripts are robustly downregulated on day 12. 

      Yes, we determined the appropriate time points using qPCR for all sample extractions. As an example, see the figure for qPCR validation at day 12 showing that pou4-2 and pkd1L2 are down.

      Author response image 1.

      In this graph, samples labeled “G” represent four biological controls of gfp(RNAi) control animals, and samples labeled “P” represent four biological controls of pou4-2(RNAi)animals at day 12 in the RNAi protocol.

      (5) On p.13, the authors wrote "collecting RNA from how animals." Is this a typo? 

      Thanks for catching the typo. It should read “whole” animals. We have corrected this.

      (6) On p.14, the authors wrote "but the expression patterns of planarian atonal genes indicated that they represent completely different cell populations from pou4-2-regulated mechanosensory neurons". However, this is unclear from the images, as the in situ staining of Atoh8-1 and Atoh82 are potentially failed stainings. 

      We agree. We have edited accordingly.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      Summary:

      The manuscript "Lifestyles shape genome size and gene content in fungal pathogens" by Fijarczyk et al. presents a comprehensive analysis of a large dataset of fungal genomes to investigate what genomic features correlate with pathogenicity and insect associations. The authors focus on a single class of fungi, due to the diversity of lifestyles and availability of genomes. They analyze a set of 12 genomic features for correlations with either pathogenicity or insect association and find that, contrary to previous assertions, repeat content does not associate with pathogenicity. They discover that the number of proteincoding genes, including the total size of non-repetitive DNA does correlate with pathogenicity. However, unique features are associated with insect associations. This work represents an important contribution to the attempts to understand what features of genomic architecture impact the evolution of pathogenicity in fungi.

      Strengths:

      The statistical methods appear to be properly employed and analyses thoroughly conducted. The manuscript is well written and the information, while dense, is generally presented in a clear manner.

      Weaknesses:

      My main concerns all involve the genomic data, how they were annotated, and the biases this could impart to the downstream analyses. The three main features I'm concerned with are sequencing technology, gene annotation, and repeat annotation.

      We thank the reviewer for all the comments. We are aware that the genome assemblies are of heterogeneous quality since they come from many sources. The goal of this study was to make the best use of the existing assemblies, with the assumption that noise introduced by the heterogeneity of sequencing methods should be overcome by the robustness of evolutionary trends and the breadth and number of analyzed assemblies. Therefore, at worst, we would expect a decrease in the power to detect existing trends. It is important to note that the only way to confidently remove all potential biases would be to sequence and analyze all species in the same way; this would require a complete study and is beyond the scope of the work presented here. Nevertheless some biases could affect the results in a negative way, eg. is if they affect fungal lifestyles differently. We therefore made an attempt to explore the impact of sequencing technology, gene and repeat annotation approach among genomes of different fungal lifestyles. Details are described in Supplementary Results and below. Overall, even though the assembly size and annotations conducted with Augustus can sometimes vary compared to annotations from other resources, such as JGI Mycocosm, we do not observe a bias associated with fungal lifestyles. Comparison of annotations conducted with Augustus and JGI Mycocosm dataset revealed variation in gene-related features that reflect biological differences rather than issues with annotation.  

      The collection of genomes is diverse and includes assemblies generated from multiple sequencing technologies including both short- and long-read technologies. Not only has the impact of the sequencing method not been evaluated, but the technology is not even listed in Table S1. From the number of scaffolds it is clear that the quality of the assemblies varies dramatically. This is going to impact many of the values important for this study, including genome size, repeat content, and gene number.

      We have now added sequencing technology in Table S1 as it was reported in NCBI. We evaluated the impact of long-read (Nanopore, PacBio, Sanger) vs short-read assemblies in Supplementary Results. In short, the proportion of different lifestyles (pathogenic vs. nonpathogenic, IA vs non-IA) were the same for short- and long-read assemblies. Indeed, longread assemblies were longer, had a higher fraction of repeats and less genes on average, but the differences between pathogenic vs. non-pathogenic (or IA vs non-IA) species were in the same direction for two sequencing technologies and in line with our results. There were some discrepancies, eg. mean intron length was longer for pathogens with long-read assemblies, but slightly shorter on average for short-read assemblies (and to lesser extent GC and pseudo tRNA count), which could explain weaker or mixed results in our study for these features.

      Additionally, since some filtering was employed for small contigs, this could also bias the results.

      The reason behind setting the lower contig length threshold was the fact that assemblies submitted to NCBI have varying lower-length thresholds. This is because assemblers do not output contigs above a certain length, and this threshold can be manipulated by the user. Setting a common min contig length was meant to remove this variation, knowing that any length cut-off will have a larger effect on short-read based assemblies than long-read-based assemblies. Notably, genome assemblies of corresponding species in JGI Mycocosm have a minimum contig length of 865 bp, not much lower than in our dataset. Importantly, in a response to a comment of previous reviewer, repeat content was recalculated on raw assembly lengths instead of on filtered assembly length. 

      I have considerable worries that the gene annotation methods could impart biases that significantly affect the main conclusions. Only 5 reference training sets were used for the Sordariomycetes and these are unequally distributed across the phylogeny. Augusts obviously performed less than ideally, as the authors reported that it under-annotated the genomes by 10%. I suspect it will have performed worse with increasing phylogenetic distance from the reference genomes. None of the species used for training were insectassociated, except for those generated by the authors for this study. As this feature was used to split the data it could impact the results. Some major results rely explicitly on having good gene annotations, like exon length, adding to these concerns. Looking manually at Table S1 at Ophiostoma, it does seem to be a general trend that the genomes annotated with Magnaporthe grisea have shorter exons than those annotated with H294. I also wonder if many of the trends evident in Figure 5 are also the result of these biases. Clades H1 and G each contain a species used in the training and have an increase in genes for example.

      We have applied 6 different reference training sets (instead of one) precisely to address the problem of increasing phylogenetic distance of annotated species. To further investigate the impact of chosen species for training, we plotted five gene features (number of genes, number of introns, intron length, exon length, fraction of genes with introns) as a function of   branch length distance from the species (or genus) used as a training set for annotation. We don’t see systematic biases across different training sets. However,  trends are very clear for clades annotated with fusarium. This set of species includes Hypocreales and Microascales, which is indeed unfortunate since Microascales is an IA group and at the same time the most distant from the fusarium genus in this set. To clarify if this trend is related to annotation bias or a biological trend, we compared gene annotations with those of Mycocosm, between Hypocreales Fusarium species, Hypocreales non-Fusarium species, and Microascales, and we observe exactly the same trends in all gene features. 

      Similarly, among species that were annotated with magnaporthe_grisea, Ophiostomatales (another IA group) are among the most distant from the training set species. Here, however, another order, Diaporthales, is similarly distant, yet the two orders display different feature ranges. In terms of exon length, top 2 species in this training set include Ophiostoma, and they reach similar exon length as the Ophiostoma species annotated using H294 as a training set. In summary, it is possible that the choice of annotation species has some effect on feature values; however, in this dataset, these biases are likely mitigated by biological differences among lifestyles and clades. 

      Unfortunately, the genomes available from NCBI will vary greatly in the quality of their repeat masking. While some will have been masked using custom libraries generated with software like Repeatmodeler, others will probably have been masked with public databases like repbase. As public databases are again biased towards certain species (Fusarium is well represented in repbase for example), this could have significant impacts on estimating repeat content. Additionally, even custom libraries can be problematic as some software (like RepeatModeler) will include multicopy host genes leading to bona fide genes being masked if proper filtering is not employed. A more consistent repeat masking pipeline would add to the robustness of the conclusions.

      We have searched for the same species in JGI Mycocosm and were able to retrieve 58 genome assemblies with matching species, with 19 of them belonging to the same strain as in our dataset. Overall we found no differences in genome assembly length. Interestingly, repeat content was slightly higher for NCBI genome assemblies compared to JGI Mycocosm assemblies, perhaps due to masking of host multicopy genes, as the reviewer mentioned. By comparing pathogenic and non-pathogenic species for the same 19 strains, we observe that JGI Mycocosm annotates fewer repeats in pathogenic species than Augustus annotations (but trends are similar when taking into account 58 matching species). Given a small number of samples, it is hard to draw any strong conclusions; however, the differences that we see are in favor of our general results showing no (or negative) correlation of repeat content with pathogenicity. 

      To a lesser degree, I wonder what impact the use of representative genomes for a species has on the analyses. Some species vary greatly in genome size, repeat content, and architecture among strains. I understand that it is difficult to address in this type of analysis, but it could be discussed.

      In our case the use of protein sequences could underestimate divergence between closely related strains from the same species. We also excluded strains of the same species to avoid overrepresentation of closely related strains with similar lifestyle traits. We agree that some changes in the genome architecture can occur very rapidly, even at the species level, though analyzing emergence of eg. pathogenicity at the population level would require a slightly different approach which accounts for population-level processes. 

      Reviewer #2 (Public review):

      Summary:

      In this paper, the authors report on the genomic correlates of the transition to the pathogenic lifestyle in Sordariomycetes. The pathogenic lifestyle was found to be better explained by the number of genes, and in particular effectors and tRNAs, but this was modulated by the type of interacting host (insect or not insect) and the ability to be vectored by insects.

      Strengths:

      The main strength of this study lies in the size of the dataset, and the potentially high number of lifestyle transitions in Sordariomycetes.

      Weaknesses:

      The main strength of the study is not the clarity of the conclusions.

      (1) This is due firstly to the presentation of the hypotheses. The introduction is poorly structured and contradictory in some places. It is also incomplete since, for example, fungusinsect associations are not mentioned in the introduction even though they are explicitly considered in the analyses.

      We thank the reviewer for pointing this out. We strived to address all comments and suggestions of the reviewer to clarify the message and remove the contradictions. We also added information about why we included insect-association trait in our analysis. 

      (2) The lack of clarity also stems from certain biases that are challenging to control in microbial comparative genomics. Indeed, defining lifestyles is complicated because many fungi exhibit different lifestyles throughout their life cycles (for instance, symbiotic phases interspersed with saprotrophic phases). In numerous fungi, the lifestyle referenced in the literature is merely the sampling substrate (such as wood or dung), which doesn't mean that this substrate is a crucial aspect of the life cycle. This issue is discussed by the authors, but they do not eliminate the underlying uncertainties.

      We agree with the reviewer that lack of certainty in the lifestyle or range of possible lifestyles of studied species is a weakness in this analysis. We are limited by the information available in the literature. We hope that our study will increase interest in collecting such data in the future.

      Reviewer #3 (Public review):

      Summary:

      This important study combines comparative genomics with other validation methods to identify the factors that mediate genome size evolution in Sordariomycetes fungi and their relationship with lifestyle. The study provides insights into genome architecture traits in this Ascomycete group, finding that, rather than transposons, the size of their genomes is often influenced by gene gain and loss. With an excellent dataset and robust statistical support, this work contributes valuable insights into genome size evolution in Sordariomycetes, a topic of interest to both the biological and bioinformatics communities.

      Strengths:

      This study is complete and well-structured.

      Bioinformatics analysis is always backed by good sampling and statistical methods. Also, the graphic part is intuitive and complementary to the text.

      Weaknesses:

      The work is great in general, I just had issues with the Figure 1B interpretation.

      I struggled a bit to find the correspondence between this sentence: "Most genomic features were correlated with genome size and with each other, with the strongest positive correlation observed between the size of the assembly excluding repeats and the number of genes (Figure 1B)." and the Figure 1B. Perhaps highlighting the key p values in the figure could help.

      We thank the reviewer for pointing out this sentence. Perhaps the misunderstanding comes from the fact that in this sentence one variable is missing. The correct version should be “Most genomic features were correlated with genome size and with each other, with the strongest positive correlation observed between the genome size, the genome size excluding repeats and the number of genes (Figure 1B)”. Also, the variable names now correspond better to those shown on the figure.

      Reviewer #1 (Recommendations for the authors):

      The authors have clearly done a lot of good work, and I think this study is worthwhile. I understand that my concerns about the underlying data could necessitate rerunning the entire analysis with better gene models, but there may be another option. JGI has a fairly standard pipeline for gene and repeat annotation. Their gene predictions are based on RNA data from the sequenced strain and should be quite good in general. One could either compare the annotations from this manuscript to those in mycocosm for genomes that are identical and see if there are systematic biases, or rerun some analyses on a subset of genomes from mycocosm. Indeed, it's possible that the large dataset used here compensates for the above concerns, but without some attempt to evaluate these issues, it's difficult to have confidence in the results.

      We very appreciate the positive reception of our manuscript. Following the reviewer’s comments we have investigated gene annotations in comparison with those of JGI Mycocosm, even though only 58 species were matching and only 19 of them were from the same strain. This dataset is not representative of the Sordariomycetes diversity (most species come from one clade), therefore will not reflect the results we obtained in this study. To note, the reason for not choosing JGI Mycocosm in the first place, was the poor representation of the insect-associated species, which we found key in this study. In general, we found that assembly lengths were nearly identical, number of genes was higher, and the repeat content was lower for the JGI Mycocosm dataset. When comparing different lifestyles (in particular pathogens vs. non-pathogens), we found the same differences for our and JGI Mycocosm annotations, with one exception being the repeat content. In the small subset (19 same-strain assemblies), our dataset showed the same level of repeats between the two lifestyles, whereas JGI Mycocosm showed lower repeat content for pathogens (but notably for all 58 species, the trend was same for our and JGI Mycocosm annotations). None of these observations are in conflict with our results where we find no or negative association of repeat content with pathogens. 

      The figures are very information-dense. While I accept that this is somewhat of a necessity for presenting this type of study, if the authors could summarize the important information in easier-to-interpret plots, that could help improve readability.

      We put a lot of effort into showing these complicated results in as approachable manner as possible. Given that other reviewers find them intuitive we decided to keep most of them as they are. To add more clarification, we added one supplementary figure showing distributions of genomic traits across lifestyles. Moreover, in Figure 5, a phylogenetic tree was added with position of selected clades, as well as a scatterplot showing distributions of mean values for genome size and number of genes for those clades. If the reviewer has any specific suggestions on what to improve and in which figure, we’re happy to consider it. 

      Reviewer #2 (Recommendations for the authors):

      I have no major comments on the analyses, which have already been extensively revised. My major criticism is the presentation of the background, which is very insufficient to understand the importance or relevance of the results presented fully.

      Lines are not numbered, unfortunately, which will not help the reading of my review.

      (1) The introduction could better present the background and hypotheses:

      (a) After reading the introduction, I still didn't have a clear understanding of the specific 'genome features' the study focuses on. The introduction fails to clearly outline the current knowledge about the genetic basis of the pathogenic lifestyle: What is known, what remains unknown, what constitutes a correlation, and what has been demonstrated? This lack of clarity makes reading difficult.

      We thank the reviewer for pointing this out. We have now included in the introduction a list of genomic traits we focus on. We also tried to be more precise about demonstrated pathogenic traits and other correlated traits in the introduction. 

      (b) Page 3. « Various features of the genome have been implicated in the evolution of the pathogenic lifestyle. » The cited studies did not genuinely link genome features to lifestyle, so the authors can't use « implicated in » - correlation does not imply causation.

      This sentence also somehow contradicts the one at the end of the paragraph: « we still have limited knowledge of which genomic features are specific to pathogenic lifestyle

      We thank the reviewer for this comment. We added a phrase “correlated with or implicated in” and changed the last sentence of the paragraph into “Yet we still have limited knowledge of how important and frequent different genomic processes are in the evolution of pathogenicity across phylogenetically distinct groups of fungi and whether we can use genomic signatures left by some of these processes as predictors of pathogenic state.”.

      (c) Page 3: « Fungal pathogen genomes, and in particular fungal plant pathogen genomes have been often linked to large sizes with expansions of TEs, and a unique presence of a compartmentalized genome with fast and slow evolving regions or chromosomes » Do the authors really need to say « often »? Do they really know how often?

      We removed “often”.

      (d) Such accessory genomic compartments were shown to facilitate the fast evolution of effectors (Dong, Raffaele, and Kamoun 2015) ». The cited paper doesn't « show » that genomic compartments facilitate the fast evolution of effectors. It's just an observation that there might be a correlation. It's an opinion piece, not a research manuscript.

      We changed the sentence to “Such accessory genomic compartments could facilitate the fast evolution of effectors”.

      (e) even though such architecture can facilitate pathogen evolution, it is currently recognized more as a side effect of a species evolutionary history rather than a pathogenicity related trait ». This sentence somehow contradicts the following one: « Such accessory genomic compartments were shown to facilitate the fast evolution of effectors".

      Here we wanted to point out that even though accessory genome compartments and TE expansions can facilitate pathogen evolution the origin of such architecture is not linked to pathogenicity. We reformulated the sentence to “Even though such architecture can facilitate pathogen evolution, it is currently recognized that its origin is more likely a side effect of a species evolutionary history rather than being caused by pathogenicity”.

      (f) As the number of genes is strongly correlated with fungal genome size (Stajich 2017), such expansions could be a major contributor to fungal genome size. » This sentence suggests that pathogens might have bigger genomes because they have more effectors. This is contradictory to the sentence right after « At the end of the spectrum are the endoparasites Microsporidia, which have among the smallest known fungal genomes ».

      The authors state that pathogens have bigger genomes and then they take an example of a pathogen that has a minimal genome. I know it's probably because they lost genes following the transition to endoparasitism and not related to their capacity to cause disease. I just want to point out that their writing could be more precise. I invite authors to think of young scholars who are new to the field of fungal evolutionary genomics.

      We thank the reviewer for prompting us to clarify the text. We rewrote this short extract as follows “Notably, not all pathogenic species experience genome or gene expansions, or show compartmentalized genome architecture. While gene family expansions are important for some pathogens, the contrary can be observed in others, such as Microsporidia. Due to transition to obligatory intracellular lifestyle these fungi show signatures of strong genome contractions and reduced gene repertoire (Katinka et al. 2001) without compromising their ability to induce disease in the host. This raises questions about universal genomic mechanisms of transition to pathogenic state.”

      (g) I find it strange that the authors do not cite - and do not present the major results of two other studies that use the same type of approach and ask the same type of question in Sordariomycetes, although not focusing on pathogenicity:

      Hensen et al.: https://pubmed.ncbi.nlm.nih.gov/37820761/

      Shen et al.: https://pubmed.ncbi.nlm.nih.gov/33148650/

      We thank the reviewer for pointing out this omission. We now added more information in the introduction to highlight the importance of the phylogenetic context in studying genome evolution as demonstrated by these studies. The following part was added to introduction:  “Other phylogenomic studies investigating a wide range of Ascomycete species, while not explicitly focusing on the neutral evolution hypothesis, have found strong phylogenetic signals in genome evolution, reflected in distinct genome characteristics (e.g., genome size, gene number, intron number, repeat content) across lineages or families (Shen et al. 2020; Hensen et al. 2023). Variation in genome size has been shown to correlate with the activity of the repeat-induced point mutation (RIP) mechanism (Hensen et al. 2023; Badet and Croll 2025), by which repeated DNA is targeted and mutated. RIP can potentially lead to a slower rate of emergence of new genes via duplication (Galagan et al. 2003), and hinder TE proliferation limiting genome size expansion (Badet and Croll 2025). Variation in genome dynamics across lineages has also been suggested to result from environmental context and lifestyle strategies (Shen et al. 2020), with Saccharomycotina yeast fungi showing reductive genome evolution and Pezizomycotina filamentous fungi exhibiting frequent gene family expansions. Given the strong impact of phylogenetic membership,  demographic history (Ne) and host-specific adaptations of pathogens on their genomes, we reasoned that further examination of genomic sequences in groups of species with various lifestyles can generate predictions regarding the architecture of pathogenic genomes.”

      (h) Genome defense mechanisms against repeated elements, such as RIP, are not mentioned while they could have a major impact on genome size (Hensen et al cited above; Badet and Croll https://www.biorxiv.org/content/10.1101/2025.01.10.632494v1.full).

      This citation is added in the text above.

      (i) Should the reader assume that the genome features to be examined are those mentioned in the first paragraph or those in the penultimate one?

      In the last paragraph of the introduction we included the complete list of investigated genomic traits.

      (j) The insect-associated lifestyle is mentioned only in the research questions on page 4, but not earlier in the introduction. Why should we care about insect-associated fungi?

      We apologize for this omission. We added a sentence explaining how neutral evolution hypotheses can explain patterns of genome evolution in endoparasites and species with specialized vectors (traits present in insect-associated species) and added a sentence in the last paragraph that this is the reason why we have selected this trait for analysis.  

      (2) Why use concatenation to infer phylogeny?

      (a) Kapli et al. https://pubmed.ncbi.nlm.nih.gov/32424311/ « Analyses of both simulated and empirical data suggest that full likelihood methods are superior to the approximate coalescent methods and to concatenation »

      (b) It also seems that a homogeneous model was used, and not a partitioned model, while the latter are more powerful. Why?

      We thank the reviewer for the comment. When we were reconstructing the phylogenetic tree  we were not aware of the publication and we followed common practices from literature for phylogenetic tree reconstruction even though currently they are not regarded as most optimal. In fact, in the first round of submission, we have included both concatenation as well as a multispecies coalescent method based on 1000 busco sequences and a concatenation method with different partitions for 250 busco sequences. All three methods produced similar topologies. Since the results were concordant, we chose to omit these analyses from the manuscript to streamline the presentation and focus on the most important results.

      (3) Other comments:

      Is there a table listing lifestyles?

      Yes, lifestyles (pathogenicity and insect-association) are listed in Supplementary Table S1. 

      (4) Summary:

      (a) seemingly similar pathogens »: meaning unclear; on what basis are they similar? why « seemingly »?

      We removed “seemingly” from the sentence.

      (b) Page 4: what's the difference between genome feature and genome trait?

      There is no difference. We apologize for the confusion. We changed “feature” to “trait” whenever it refers to the specific 13 genomic traits analyzed in this study.

      (c) Page 22: Braker, not Breaker

      corrected

      What do the authors mean when they write that genes were predicted with Augustus and Braker? Do they mean that the two sets of gene models were combined? Gene counts are based on Augustus (P24): why not Braker?

      We only meant here that gene annotation was performed using Braker pipeline, which uses a particular version of Augustus. We corrected the sentence.

      (d) Figure 2B and 2C:

      'Undetermined sign' or 'Positive/Negative' would be better than « YES » or it's just impossible to understand the figure without reading the legend.

      We changed “YES” to “UNDETERMINED SIGN” as suggested by the reviewer.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1(Public Reviews):

      Summary: 

      Here, Millet et al. consider whether the nematode C. elegans 'discounts' the value of reward due to effort in a manner similar to that shown in other species, including rodents and humans. They designed a T-maze effort choice paradigm inspired by previous literature, but manipulated how effortful the food is to consume.C. elegans worms were sensitive to this novel manipulation, exhibiting effort-discountinglike behaviour that could be shaped by varying the density of food at each alternative in order to calculate an indifference point. This discounting-like behaviour was related to worms' rates of patch leaving, which differed between the low and high effort patches in isolation. The authors also found a potential relationship to dopamine signalling, and also that this discounting behaviour was not specific to lab-based strains of C. elegans

      Strengths: 

      The question is well-motivated, and the approach taken here is novel. The authors are careful in their approach to altering and testing the properties of the effortful, elongated bacteria. Similarly, they go to some effort to understand what exactly is driving behavioural choices in this context, both through the application of simple standard models of effort discounting and a kinetic analysis of patch leaving. The comparisons to various dopamine mutants further extend the translational potential of their findings. I also appreciate the comparison to natural isolate strains, as the question of whether this behaviour may be driven by some sort of strain-specific adaptation to the environment is not regularly addressed in mammalian counterparts. The manuscript is well-written, and the figures are clear and comprehensible. 

      Weaknesses: 

      Discounting is typically defined as the alteration of a subjective value by effort (or time, risk, etc.), which is then used to guide future decision-making. By adapting the standard t-maze task for C. elegans as a patch-leaving paradigm, the authors observe behaviour strongly consistent with discounting models, but that is likely driven by a different process, in particular by an online estimate of the type of food in the current patch, which then influences patch-leaving dynamics (Figure 3). This is fundamentally different from decision-making strategies relating to effort that have been described in the rodent and human literatures. 

      We agree that in our study worms are likely making an on-line estimate of food quality in the current patch, but we wish to point out that rodents and humans also use on-line estimates in some significant effort-discounting paradigms. With respect to rodents, we call attention to effort discounting studies involving the widely used progressive ratio task (references in Discussion). In this task, animals can either lever-press for a preferred food or consume a less preferred food that is freely available nearby. However, the number of lever presses required to obtain preferred food increases as a function of the cumulative number of lever presses until the effort-cost of obtaining preferred food becomes too high and the animal switches to a freely available food. In essence, the lever and the freely available food are patches and the animal decides whether or not to leave the “lever” patch. It seems inescapable that the progressive ratio task involves an on-line assessment of the cost/benefit relationship associated with lever pressing. With respect to humans, one highly cited study (reference in Discussion) presented participants with a series of virtual apple trees. They could see how many apples are in the current tree and how much effort (squeezing a handgrip) is required to gather them. Their task was to decide whether or not to gather apples from that tree based on the perceived cost and benefit. Thus, on-line estimation is a common strategy used by animals and humans as shown in the effort discounting literature. We now make this point in the Discussion section titled A model of effort-discounting like behavior.

      Similarly, the calculation of indifference points at the group instead of at the individual level also suggests a different underlying process and limits the translational potential of their findings. The authors do not discuss the implications of these differences or why they chose not to attempt a more analogous trial-based experiment.  

      It is not clear to us why changing the read-out –– from the individual level to the population level –– necessarily suggests that a different biological mechanism is at work. In our view, there is one mechanism and it can be seen from different perspectives (e.g., individual vs population). Furthermore, the analogous trial-based experiment, as we understand it, would be to record behavior one worm at a time in the T-maze. This design is not practical because it entails recording a large number of single worms in the T-maze for 60 min each. 

      In the case of both the dopamine and natural isolate experiments, the data are very noisy despite large (relative to other C. elegans experiments) sample sizes. In the dopamine experiment, disruption of dop1, dop-2, and cat-2 had no statistically significant effect. There do not appear to be any corrections for multiple comparisons, and the single significant comparison, for dop-3, had a small effect size. 

      An ANOVA followed by a Dunnett test was used to test differences between groups in Fig. 4 and 5. The Dunnett test is a multiple comparison test comparing experimental groups to a single control group. It is used to minimize type I error while maintaining statistical power and does not require further correction for multiple comparisons. We have clarified the use of the Dunnett test in the statistical table.  The effect size for dop-3 is 0.5 (Cohen’s d), which is typically interpreted as a medium, not small, effect size.(e.g. Cohen, Psychological Bulletin, 1992, Vol. 112. No. 1,155-159). 

      More detailed behavioural analyses on both these and the wild isolate strains, for example by applying their kinetic analysis, would likely give greater insight as to what is driving these inconsistent effects. 

      More detailed behavioral analysis could reveal why we observe a difference in effort discounting in some strains and not others. However, it is not obvious what type of behavioral analysis would be needed to differentiate between pleiotropic effects of the mutations/natural isolates and more specific effects on effort discounting. A simple kinetic analysis in particular may not be enough to reveal relevant differences between mutants/natural isolates. For this reason, we think that such experiments may be better suited for future follow up studies.

      Reviewer #2 (Public Reviews)

      Summary: 

      Millet et al. show that C. elegans systematically prefers easy-to-eat bacteria but will switch its choice when harder-to-eat bacteria are offered at higher densities, producing indifference points that fit standard economic discounting models. Detailed kinetic analysis reveals that this bias arises from unchanged patch-entry rates but significantly elevated exit rates on effortful food, and dop-3 mutants lose the preference altogether, implicating dopamine in effort sensitivity. These findings extend effortdiscounting behavior to a simple nematode, pushing the phylogenetic boundary of economic costbenefit decision-making. 

      Strengths: 

      (1) Extends the well-characterized concept of effort discounting into C. elegans , setting a new phylogenetic boundary and opening invertebrate genetics to economic-behavior studies. 

      (2) Elegant use of cephalexin-elongated bacteria to manipulate "effort" without altering nutritional or olfactory cues, yielding clear preference reversals and reproducible indifference points. 

      (3) Application of standard discounting models to predict novel indifference points is both rigorous and quantitatively satisfying, reinforcing the interpretation of worm behavior in economic terms. 

      (4) The three-state patch-model cleanly separates entry and exit dynamics, showing that increased leaving rates-rather than altered re-entry-drive choice biases. 

      (5) Investigates the role of dopamine in this behavior to try to establish shared mechanisms with vertebrates. 

      (6) Demonstration of discounting in wild strain (solid evidence). 

      Weaknesses: 

      (1) The kinetic model omits rich trajectory details-such as turning angles or hazard functions-that could distinguish a bona fide roaming transition from other exit behaviors. 

      The overarching goal of present paper was to develop a simple model for effort discounting in a small, genetically tractable organism.  Accordingly,  we focused on quantitative assays that are easy to implement and analyze. The patch-leaving assay and its associated kinetic analysis are one such assay. To keep things simple in this assay, we counted the number of  transitions between the three states shown in Fig. 3A. We chose not to analyze the data in terms of turning angles or hazard functions because the metrics we developed seemed sufficient. Finally, we note that there are new modeling data showing that the presumptive transitions into the roaming state can be explained in terms of a one-state stochastic model in which there is no discrete roaming state (Elife. 2025 Jul 30;14:RP104972. doi:

      10.7554/eLife.104972.PMID: 40736321).

      (2) Only dop-3 shows an effect, and the statistical validity of this result is questionable. It is not clear if the authors corrected for multiple comparisons, and the effect size is quite small and noisy, given the large number of worms tested. Other mutants do not show effects. Given these two concerns, the role of dopamine in C. elegans effort discounting was unconvincing. 

      An ANOVA followed by a Dunnett test was used to test statistical significance in figures 4 and 5 (see above for a discussion of these tests). We believe this approach is rigorous, and the use of these tests is statistically valid. We note that the effect size for this comparison was medium.

      (3) With only five wild isolates tested (and variable data quality), it's hard to conclude that effort discounting isn't a lab-strain artifact or how broadly it varies in natural populations. 

      The fact that four of the five natural isolates tested display levels of effort discounting similar to N2 (only one natural isolate does not display effort discounting) argues against effort discounting being a laboratory adaption.  We have nevertheless weakened the claim regarding natural isolates. We now say effort discounting-like behavior may not be an adaptation to the laboratory environment.  

      (4) Detailed analysis of behavior beyond preference indices would strengthen the dopamine link and the claim of effort discounting in wild strains. 

      Going beyond preference in the behavioral analysis might or might not reveal new phenotypes that strengthen the link with dopamine. At present, however, we think such experiments are beyond the scope of the paper.

      (5) A few mechanistic statements (e.g., tying satiety exclusively to nutrient signals) would benefit from explicit citations or brief clarifications for non-worm specialists. 

      We are unable to identify a mechanistic statement tying satiety to nutrient signals in our manuscript.

      Reviewer #3 (Public Reviews)

      Summary: 

      The authors establish a behavioral task to explore effort discounting in C. eleganss . By using bacterial food that takes longer to consume, the authors show that, for equivalent effort, as measured by pumping rate, they obtain less food, as measured by fat deposition. The authors formalize the task by applying a formal neuroeconomic decision-making model that includes value, effort, and discounting. They use this to estimate the discounting that C. elegans applies based on ingestion effort by using a population-level 2-choice T-maze. They then analyze the behavioral dynamics of individual animals transitioning between on-food and off-food states. Harder to ingest bacteria led to increased food patch leaving. Finally, they examined a set of mutants defective in different aspects of dopamine signaling, as dopamine plays a key role in discounting in vertebrates and regulates certain aspects of C. elegans foraging. 

      Strengths: 

      The behavioral experiments and neuroeconomic analysis framework are compelling, interesting, and make a significant contribution to the field. While these foraging behaviors have been extensively studied, few include clearly articulated theoretical models to be tested. 

      Demonstrating that C. elegans effort discounting fits model predictions and has stable indifference points is important for establishing these tasks as a model for decision making. 

      Weaknesses: 

      The dopamine experiments are harder to interpret. The authors point out the perplexing lack of an effect of dat-1 and cat-2. dop-3 leads to general indifference. I am not sure this is the expected result if the argument is a parallel functional role to discounting in vertebrates. dop-3 causes a range of locomotor phenotypes and may affect feeding (reduced fat storage), and thus, there may be a general defect in the ability to perform the task rather than anything specific to discounting.

      That said, some of the other DA mutants also have locomotor defects and do not differ from N2. But there is no clear result here - my concern is that global mutants in such a critical pathway exhibit such pleiotropy that it's difficult to conclude there is a clear and specific role for DA in effort discounting. This would require more targeted or cell-specific approaches. 

      We agree with the reviewer that the results of the dopamine experiments are puzzling and getting a better understanding of the role of dopamine in effort-discounting will require more sensitive assays and different experimental approaches (e.g. cell-specific rescues). However, as mentioned by the reviewer, all the mutations tested have some pleiotropic effects, yet only dop-3 displays a defect in effort discounting. This, in our opinion, points to a specific role of dop-3 in effort-discounting in C. elegans. This point is now made in the Discussion in the section titled Role of dopamine signaling in effort discountinglike behavior.

      Meanwhile, there are other pathways known to affect responses to food and patch leaving decisions: serotonin, pigment-dispersing factor, tyramine, etc. The paper would have benefited from a clarification about why these were not considered as promising candidates to test (in addition to or instead of dopamine). 

      We focused on DA because of its well-established effect on effort discounting in rodents.

      Testing other pathways is a goal for future research.

      Reviewer #1 (Recommendations for the authors):

      The current results are more a reframing of data gathered from a patch-leaving paradigm, but described in the form of economic choice modelling in which discounting is one possible explanation. One more parsimonious explanation that worms estimate in real-time some rate of reward and leave the patch at some threshold, consistent with canonical foraging models, previous experiments in C. elegans, and the authors' own data (Figure 3). Therefore, I am wary about some of the claims made in this manuscript, such as 'decision-making strategies based on effort-cost trade-offs are evolutionarily conserved'. 

      These points are now addressed in the Discussion in a revised section titled A model of effortdiscounting like behavior. (i) We now call attention to the fact that our T-maze assay is a patch-leaving foraging paradigm. (ii) We now propose a revised model in which “worms make an on-line assessment of food value in the current patch which in turn alters patch-leaving dynamics, increasing the exit rates from cephalexin-treated patches as shown in Figure 3.” (iii) We now provide evidence from the rodent and human literature that the strategy of on-line assessment of reward value may be evolutionarily conserved in the case of a class of effort discounting tasks whose solution requires on-line assessments. 

      If the reason the authors chose to do a patch-leaving style task rather than a traditional t-maze is because C. elegans is unable to retain the sort of information necessary to make such simultaneous decisions - e.g., if pre-training on the two options isn't possible - then this in itself suggests that mechanisms underlying these decisions in worms and mammals are unlikely to be the same. I mention this because I would like to suggest to the authors an alternative interpretation: that patch foraging is actually 'the' canonical computation that translates across species. This would, in fact, be nicely consistent with some other recent modelling work in humans, e.g., https://www.biorxiv.org/content/10.1101/2025.05.06.652482v1

      Please see the previous response.

      Reviewer #2 (Recommendations for the authors):

      Can you provide a picture of the regular and CEPH bacteria? 

      Done (see Figure 1––figure supplement 1).

      Reviewer #3 (Recommendations for the authors):

      I would recommend testing representative mutants in other pathways in the choice task. If possible, more targeted experiments with dop-3, including either cell-specific KOs or rescues, would very much strengthen this aspect of the paper. 

      While valuable, these experiments are out of scope for the present study.

    1. Author response:

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      Bansal et al. present a study on the fundamental blood and nectar feeding behaviors of the critical disease vector, Anopheles stephensi. The study encompasses not just the fundamental changes in blood feeding behaviors of the crucially understudied vector, but then uses a transcriptomic approach to identify candidate neuromodulation pathways which influence blood feeding behavior in this mosquito species. The authors then provide evidence through RNAi knockdown of candidate pathways that the neuromodulators sNPF and Rya modulate feeding either via their physiological activity in the brain alone or through joint physiological activity along the brain-gut axis (but critically not the gut alone). Overall, I found this study to be built on tractable, well-designed behavioral experiments.

      Their study begins with a well-structured experiment to assess how the feeding behaviors of A. stephensi change over the course of its life history and in response to its age, mating, and oviposition status. The authors are careful and validate their experimental paradigm in the more well-studied Ae. aegypti, and are able to recapitulate the results of prior studies, which show that mating is a prerequisite for blood feeding behaviors in Ae. aegypt. Here they find A. Stephensi, like other Anopheline mosquitoes, has a more nuanced regulation of its blood and nectar feeding behaviors.

      The authors then go on to show in a Y-maze olfactometer that ,to some degree, changes in blood feeding status depend on behavioral modulation to host cues, and this is not likely to be a simple change to the biting behaviors alone. I was especially struck by the swap in valence of the host cues for the blood-fed and mated individuals, which had not yet oviposited. This indicates that there is a change in behavior that is not simply desensitization to host cues while navigating in flight, but something much more exciting is happening.

      The authors then use a transcriptomic approach to identify candidate genes in the blood-feeding stages of the mosquito's life cycle to identify a list of 9 candidates that have a role in regulating the host-seeking status of A. stephensi. Then, through investigations of gene knockdown of candidates, they identify the dual action of RYa and sNPF and candidate neuromodulators of host-seeking in this species. Overall, I found the experiments to be well-designed. I found the molecular approach to be sound. While I do not think the molecular approach is necessarily an all-encompassing mechanism identification (owing mostly to the fact that genetic resources are not yet available in A. stephensi as they are in other dipteran models), I think it sets up a rich line of research questions for the neurobiology of mosquito behavioral plasticity and comparative evolution of neuromodulator action.

      We appreciate the reviewer’s detailed summary of our work. We thank them for their positive comments and agree with them on the shortcomings of our approach.

      Strengths:

      I am especially impressed by the authors' attention to small details in the course of this article. As I read and evaluated this article, I continued to think about how many crucial details could potentially have been missed if this had not been the approach. The attention to detail paid off in spades and allowed the authors to carefully tease apart molecular candidates of blood-seeking stages. The authors' top-down approach to identifying RYamide and sNPF starting from first principles behavioral experiments is especially comprehensive. The results from both the behavioral and molecular target studies will have broad implications for the vectorial capacity of this species and comparative evolution of neural circuit modulation.

      We really appreciate that the reviewer has recognised the attention to detail we have tried to put, thank you!

      Weaknesses:

      There are a few elements of data visualizations and methodological reporting that I found confusing on a first few read-throughs. Figure 1F, for example, was initially confusing as it made it seem as though there were multiple 2-choice assays for each of the conditions. I would recommend removing the "X" marker from the x-axis to indicate the mosquitoes did not feed from either nectar, blood, or neither in order to make it clear that there was one assay in which mosquitoes had access to both food sources, and the data quantify if they took both meals, one meal, or no meals.

      We thank the reviewer for flagging the schematic in figure 1F. As suggested, we have removed the “X” markers from the x-axis and revised the axis label from “choice of food” to “choice made” to better reflect what food the mosquitoes chose in the assay. For clarity, we have now also plotted the same data as stacked graphs at the bottom of Fig. 1F, which clearly shows the proportion of mosquitoes fed on each particular choice. We avoid the stacked graph as the sole representation of this data, as it does not capture the variability in the data.

      I would also like to know more about how the authors achieved tissue-specific knockdown for RNAi experiments. I think this is an intriguing methodology, but I could not figure out from the methods why injections either had whole-body or abdomen-specific knockdown.

      The tissue-specific knockdown (abdomen only or abdomen+head) emerged from initial standardisations where we were unable to achieve knockdown in the head unless we used higher concentrations of dsRNA and did the injections in older females. We realised that this gave us the opportunity to isolate the neuronal contribution of these neuropeptides in the phenotype produced. Further optimisations revealed that injecting dsRNA into 0-10h old females produced abdomen-specific knockdowns without affecting head expression, whereas injections into 4 days old females resulted in knockdowns in both tissues. Moreover, head knockdowns in older females required higher dsRNA concentrations, with knockdown efficiency correlating with the amount injected. In contrast, abdominal knockdowns in younger females could be achieved even with lower dsRNA amounts.

      We have mentioned the knockdown conditions- time of injection and the amount dsRNA injected- for tissue-specific knockdowns in methods but realise now that it does not explain this well enough. We have now edited it to state our methodology more clearly (see lines 932-948).

      I also found some interpretations of the transcriptomic to be overly broad for what transcriptomes can actually tell us about the organism's state. For example, the authors mention, "Interestingly, we found that  after a blood meal, glucose is neither spent nor stored, and that the female brain goes into a state of metabolic 'sugar rest', while actively processing proteins (Figure S2B, S3)".

      This would require a physiological measurement to actually know. It certainly suggests that there are changes in carbohydrate metabolism, but there are too many alternative interpretations to make this broad claim from transcriptomic data alone.

      We thank the reviewer for pointing this out and agree with them. We have now edited our statement to read:

      “Instead, our data suggests altered carbohydrate metabolism  after a blood meal, with the female brain potentially entering a state of metabolic 'sugar rest' while actively processing proteins (Figure S2B, S3). However, physiological measurements of carbohydrate and protein metabolism will be required to confirm whether glucose is indeed neither spent nor stored during this period.” See lines 271-277.

      Reviewer #2 (Public review):

      Summary:

      In this manuscript, Bansal et al examine and characterize feeding behaviour in Anopheles stephensi mosquitoes. While sharing some similarities to the well-studied Aedes aegypti mosquito, the authors demonstrate that mated females, but not unmated (virgin) females, exhibit suppression in their bloodfeeding behaviour. Using brain transcriptomic analysis comparing sugar-fed, blood-fed, and starved mosquitoes, several candidate genes potentially responsible for influencing blood-feeding behaviour were identified, including two neuropeptides (short NPF and RYamide) that are known to modulate feeding behaviour in other mosquito species. Using molecular tools, including in situ hybridization, the authors map the distribution of cells producing these neuropeptides in the nervous system and in the gut. Further, by implementing systemic RNA interference (RNAi), the study suggests that both neuropeptides appear to promote blood-feeding (but do not impact sugar feeding), although the impact was observed only  after both neuropeptide genes underwent knockdown.

      Strengths and/or weaknesses:

      Overall, the manuscript was well-written; however, the authors should review carefully, as some sections would benefit from restructuring to improve clarity. Some statements need to be rectified as they are factually inaccurate.

      Below are specific concerns and clarifications needed in the opinion of this reviewer:

      (1) What does "central brains" refer to in abstract and in other sections of the manuscript (including methods and results)? This term is ambiguous, and the authors should more clearly define what specific components of the central nervous system was/were used in their study.

      Central brain, or mid brain, is a commonly used term to refer to brain structures/neuropils without the optic lobes (For example: https://www.nature.com/articles/s41586-024-07686-5). In this study we have focused our analysis on the central brain circuits involved in modulating blood-feeding behaviour and have therefore excluded the optic lobes. As optic lobes account for nearly half of all the neurons in the mosquito brain (https://pmc.ncbi.nlm.nih.gov/articles/PMC8121336/), including them would have disproportionately skewed our transcriptomic data toward visual processing pathways.

      We have indicated this in figure 3A and in the methods (see lines 800-801, 812). We have now also clarified it in the results section for neuro-transcriptomics to avoid confusion (see lines 236-237).

      (2) The abstract states that two neuropeptides, sNPF and RYamide are working together, but no evidence is summarized for the latter in this section.

      We thank the reviewer for pointing this out. We have now added a statement “This occurs in the context of the action of RYa in the brain” to end of the abstract, for a complete summary of our proposed model.

      (3) Figure 1

      Panel A: This should include mating events in the reproductive cycle to demonstrate differences in the feeding behavior of Ae. aegypti.

      Our data suggest that mating can occur at any time between eclosion and oviposition in An. stephensi and between eclosion and blood feeding in Ae. aegypti. Adding these into (already busy) 1A, would cloud the purpose of the schematic, which is to indicate the time points used in the behavioural assays and transcriptomics.

      Panel F: In treatments where insects were not provided either blood or sugar, how is it that some females and males had fed? Also, it is unclear why the y-axis label is % fed when the caption indicates this is a choice assay. Also, it is interesting that sugar-starved females did not increase sugar intake. Is there any explanation for this (was it expected)?

      We apologise for the confusion. The experiment is indeed a choice assay in which sugar-starved or sugar-sated females, co-housed with males, were provided simultaneous access to both blood and sugar, and were assessed for the choice made (indicated on the x-axis): both blood and sugar, blood only, sugar only, or neither. The x-axis indicates the choice made by the mosquitoes, not the choice provided in the assay, and the y-axis indicates the percentage of males or females that made each particular choice. We have now removed the “X” markers from the x-axis and revised the axis label from “choice of food” to “choice made” to better reflect what food the mosquitoes chose to take.

      In this assay, we scored females only for the presence or absence of each meal type (blood or sugar) and are therefore unable to comment on whether sugar-starved females consumed more sugar than sugarsated females. However, when sugar-starved, a higher proportion of females consumed both blood and sugar, while fewer fed on blood alone.

      For clarity, we have now also plotted the same data as stacked graphs at the bottom of Fig. 1F, which clearly shows the proportion of mosquitoes fed on each particular choice. We avoid the stacked graph as the sole representation of this data as it does not capture the variability in the data.

      (4) Figure 3

      In the neurotranscriptome analysis of the (central) brain involving the two types of comparisons, can the authors clarify what "excluded in males" refers to? Does this imply that only genes not expressed in males were considered in the analysis? If so, what about co-expressed genes that have a specific function in female feeding behaviour?

      This is indeed correct. We reasoned that since blood feeding is exclusive to females, we should focus our analysis on genes that were specifically upregulated in them. As the reviewer points out, it is very likely that genes commonly upregulated in males and females may also promote blood feeding and we will miss out on any such candidates based on our selection criteria.

      (5) Figure 4

      The authors state that there is more efficient knockdown in the head of unfed females; however, this is not accurate since they only get knockdown in unfed animals, and no evidence of any knockdown in fed animals (panel D). This point should be revised in the results test as well.

      Perhaps we do not understand the reviewer’s point or there has been a misunderstanding. In figure 4D, we show that while there is more robust gene knockdown in unfed females, blood-fed females also showed modest but measurable knockdowns ranging from 5-40% for RYamide and 2-21% for sNPF.

      Relatedly, blood-feeding is decreased when both neuropeptide transcripts are targeted compared to uninjected (panel C) but not compared to dsGFP injected (panel E). Why is this the case if authors showed earlier in this figure (panel B) that dsGFP does not impact blood feeding?

      We realise this concern stems from our representation of the data. Since we had earlier determined that dsGFP-injected females fed similarly to uninjected females (fig 4B), we used these controls interchangeably in subsequent experiments. To avoid confusion, we have now only used the label ‘control’ in figure 4 (and supplementary figure S9) and specified which control was used for each experiment in the legend.

      In addition to this, we wanted to clarify that fig 4C and 4E are independent experiments. 4C is the behaviour corresponding to when the neuropeptides were knocked down in both heads and abdomens.

      4E is the behaviour corresponding to when the neuropeptides were knocked down in only the abdomens. We have now added a schematic in the plots to make this clearer.

      In addition, do the uninjected and dsGFP-injected relative mRNA expression data reflect combined RYa and sNPF levels? Why is there no variation in these data,…

      In these qPCRs, we calculated relative mRNA expression using the delta-delta Ct method (see line 975). For each neuropeptide its respective control was used. For simplicity, we combined the RYa and sNPF control data into a single representation. The value of this control is invariant because this method sets the control baseline to a value of 1.

      …and how do transcript levels of RYa and sNPF compare in the brain versus the abdomen (the presentation of data doesn't make this relationship clear).

      The reviewer is correct in pointing out that we have not clarified this relationship in our current presentation. While we have not performed absolute mRNA quantifications, we extracted relative mRNA levels from qPCR data of 96h old unmanipulated control females. We observed that both sNPF and RYa transcripts are expressed at much lower levels in the abdomens, as compared to those in the heads, as shown in the graphs inserted below.

      Author response image 1.

      (6) As an overall comment, the figure captions are far too long and include redundant text presented in the methods and results sections.

      We thank the reviewer for flagging this and have now edited the legends to remove redundancy.

      (7) Criteria used for identifying neuropeptides promoting blood-feeding: statement that reads "all neuropeptides, since these are known to regulate feeding behaviours". This is not accurate since not all neuropeptides govern feeding behaviors, while certainly a subset do play a role.

      We agree with the reviewer that not all neuropeptides regulate feeding behaviours. Our statement refers to the screening approach we used: in our shortlist of candidates, we chose to validate all neuropeptides.

      (8) In the section beginning with "Two neuropeptides - sNPF and RYa - showed about 25% and 40% reduced mRNA levels...", the authors state that there was no change in blood-feeding and later state the opposite. The wording should be clarified as it is unclear.

      Thank you for pointing this out. We were referring to an unchanged proportion of the blood fed females. We have now edited the text to the following:

      “Two neuropeptides - sNPF and RYa - showed about 25% and 40% reduced mRNA levels in the heads but the proportion of females that took blood meals remained unchanged”. See lines 338-340.

      (9) Just before the conclusions section, the statement that "neuropeptide receptors are often ligand promiscuous" is unjustified. Indeed, many studies have shown in heterologous systems that high concentrations of structurally related peptides, which are not physiologically relevant, might cross-react and activate a receptor belonging to a different peptide family; however, the natural ligand is often many times more potent (in most cases, orders of magnitude) than structurally related peptides. This is certainly the case for various RYamide and sNPF receptors characterized in various insect species.

      We agree with the reviewer and apologise for the mistake. We have now removed the statement.

      (10) Methods

      In the dsRNA-mediated gene knockdown section, the authors could more clearly describe how much dsRNA was injected per target. At the moment, the reader must carry out calculations based on the concentrations provided and the injected volume range provided later in this section.

      We have now edited the section to reflect the amount of dsRNA injected per target. Please see lines 921-931.

      It is also unclear how tissue-specific knockdown was achieved by performing injection on different days/times. The authors need to explain/support, and justify how temporal differences in injection lead to changes in tissue-specific expression. Does the blood-brain barrier limit knockdown in the brain instead, while leaving expression in the peripheral organs susceptible?

      To achieve tissue-specific knockdowns of sNPF and RYa, we optimised both the time of injection as well as the dsRNA concentration to be injected. Injecting dsRNA into 0-10h females produced abdomen specific knockdowns without affecting head expression, whereas injections into 96h old females resulted in knockdowns in both tissues. Head knockdowns in older females required higher dsRNA concentrations, with knockdown efficiency correlating with the amount injected. In contrast, abdominal knockdowns in younger females could be achieved even with lower dsRNA amounts, reflecting the lower baseline expression of sNPF in abdomens compared to heads and the age-dependent increase in head expression (as confirmed by qPCR). It is possible that the blood-brain barrier also limits the dsRNA entering the brain, thereby requiring higher amounts to be injected for head knockdowns.

      We have now edited this section to state our methodology more clearly (see lines 932-948).

      For example, in Figure 4, the data support that knockdown in the head/brain is only effective in unfed animals compared to uninjected animals, while there is no evidence of knockdown in the brain relative to dsGFP-injected animals. Comparatively, evidence appears to show stronger evidence of abdominal knockdown mostly for the RYa transcript (>90%) while still significantly for the sNPF transcript (>60%).

      As we explained earlier, this concern likely stems from our representation of the data. Since we had earlier determined that dsGFP-injected females fed similarly to uninjected females (fig 4B), we used these controls interchangeably in subsequent experiments. To avoid confusion, we have now only used the label ‘control’ in figure 4 (and supplementary figure S9) and specified which control was used for each experiment in the legend.

      In addition to this, we wanted to clarify that fig 4C and 4E are independent experiments. 4C is the behaviour corresponding to when the neuropeptides were knocked down in both heads and abdomens. 4E is the behaviour corresponding to when the neuropeptides were knocked down in only the abdomen. We have now added a schematic in the plots to make this clearer.

      Reviewer #3 (Public review):

      Summary:

      This manuscript investigates the regulation of host-seeking behavior in Anopheles stephensi females across different life stages and mating states. Through transcriptomic profiling, the authors identify differential gene expression between "blood-hungry" and "blood-sated" states. Two neuropeptides, sNPF and RYamide, are highlighted as potential mediators of host-seeking behavior. RNAi knockdown of these peptides alters host-seeking activity, and their expression is anatomically mapped in the mosquito brain (sNPF and RYamide) and midgut (sNPF only).

      Strengths:

      (1) The study addresses an important question in mosquito biology, with relevance to vector control and disease transmission.

      (2) Transcriptomic profiling is used to uncover gene expression changes linked to behavioral states.

      (3) The identification of sNPF and RYamide as candidate regulators provides a clear focus for downstream mechanistic work.

      (3) RNAi experiments demonstrate that these neuropeptides are necessary for normal host-seeking behavior.

      (4) Anatomical localization of neuropeptide expression adds depth to the functional findings.

      Weaknesses:

      (1) The title implies that the neuropeptides promote host-seeking, but sufficiency is not demonstrated (for example, with peptide injection or overexpression experiments).

      Demonstrating sufficiency would require injecting sNPF peptide or its agonist. To date, no small-molecule agonists (or antagonists) that selectively mimic sNPF or RYa neuropeptides have been identified in insects. An NPY analogue, TM30335, has been reported to activate the Aedes aegypti NPY-like receptor 7 (NPYLR7; Duvall et al., 2019), which is also activated by sNPF peptides at higher doses (Liesch et al., 2013). Unfortunately, the compound is no longer available because its manufacturer, 7TM Pharma, has ceased operations. Synthesising the peptides is a possibility that we will explore in the future.

      (2) The proposed model regarding central versus peripheral (gut) peptide action is inconsistently presented and lacks strong experimental support.

      The best way to address this would be to conduct tissue-specific manipulations, the tools for which are not available in this species. Our approach to achieve head+abdomen and abdomen only knockdown was the closest we could get to achieving tissue specificity and allowed us to confirm that knockdown in the head was necessary for the phenotype. However, as the reviewer points out, this did not allow us to rule out any involvement of the abdomen. This point has been addressed in lines 364-371.

      (3) Some conclusions appear premature based on the current data and would benefit from additional functional validation.

      The most definitive way of demonstrating necessity of sNPF and RYa in blood feeding would be to generate mutant lines. While we are pursuing this line of experiments, they lie beyond the scope of a revision. In its absence, we relied on the knockdown of the genes using dsRNA. We would like to posit that despite only partial knockdown, mosquitoes do display defects in blood-feeding behaviour, without affecting sugar-feeding. We think this reflects the importance of sNPF in promoting blood feeding.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      Overall, I found this manuscript to be well-prepared, visually the figures are great and clearly were carefully thought out and curated, and the research is impacwul. It was a wonderful read from start to finish. I have the following recommendations:

      Thank you very much, we are very pleased to hear that you enjoyed reading our manuscript!

      (1) For future manuscripts, it would make things significantly easier on the reviewer side to submit a format that uses line numbers.

      We sincerely apologise for the oversight. We have now incorporated line numbers in the revised manuscript.

      (2) There are a few statements in the text that I think may need clarification or might be outside the bounds of what was actually studied here. For example, in the introduction "However, mating is dispensable in Anophelines even under conditions of nutritional satiety". I am uncertain what is meant by this statement - please clarify.

      We apologise for the lack of clarity in the statement and have now deleted it since we felt it was not necessary.

      (3) Typo/Grammatical minutiae:

      a) A small idiosyncrasy of using hyphens in compound words should also be fixed throughout. Typically, you don't hyphenate if the words are being used as a noun, as in the case: e.g. "Age affects blood feeding.". However, you would hyphenate if the two words are used as a compound adjective "Age affects blood-feeding behavior". This may not be an all-inclusive list, but here are some examples where hyphens need to either be removed or added. Some examples:

      "Nutritional state also influences other internal state outputs on blood-feeding": blood-feeding -> blood feeding

      "... the modulation of blood-feeding": blood-feeding -> blood feeding

      "For example, whether virgin females take blood-meals...": blood-meals -> blood meals

      ".... how internal and external cues shape meal-choice"-> meal choice

      "blood-meal" is often used throughout the text, but is correctly "blood meal" in the figures.

      There are many more examples throughout.

      We apologise for these errors and appreciate the reviewer’s keen eye. We have now fixed them throughout the manuscript.

      b) Figure 1 Caption has a typo: "co-housed males were accessed for sugar-feeding" should be "co-housed males were assessed for sugar feeding"

      We apologise for the typo and thank the reviewer for spotting it. We have now corrected this.

      c) It would be helpful in some other figure captions to more clearly label which statement is relevant to which part of the text. For example, in Figure 4's caption.

      "C,D. Blood-feeding and sugar-feeding behaviour of females when both RYa and sNPF are knocked down in the head (C). Relative mRNA expressions of RYa and sNPF in the heads of dsRYa+dssNPF - injected blood-fed and unfed females, as compared to that in uninjected females, analysed via qPCR (D)."

      I found re-referencing C and D at the end of their statements makes it look as thought C precedes the "Relative mRNA expression" and on a first read through, I thought the figure captions were backwards. I'd recommend reformating here and throughout consistently to only have the figure letter precede its relevant caption information, e.g.:

      "C. Blood-feeding and sugar-feeding behaviour of females when both RYa and sNPF are knocked down in the head. D. Relative mRNA expressions of RYa and sNPF in the heads of dsRYa+dssNPF - injected bloodfed and unfed females, as compared to that in uninjected females, analysed via qPCR."

      We have now edited the legends as suggested.

      Reviewer #2 (Recommendations for the authors):

      Separately from the clarifications and limitations listed above, the authors could strengthen their study and the conclusions drawn if they could rescue the behavioural phenotype observed following knockdown of sNPF and RYamide. This could be achieved by injection of either sNPF or RYa peptide independently or combined following knockdown to validate the role of these peptides in promoting blood-feeding in An. stephensi. Additionally, the apparent (but unclear) regionalized (or tissue-specific) knockdown of sNPF and RYamide transcripts could be visualized and verified by implementing HCR in situ hyb in knockdown animals (or immunohistochemistry using antibodies specific for these two neuropeptides).

      In a follow up of this work, we are generating mutants and peptides for these candidates and are planning to conduct exactly the experiments the reviewer suggests.

      Reviewer #3 (Recommendations for the authors):

      The loss-of-function data suggest necessity but not sufficiency. Synthetic peptide injection in non-host seeking (blood-fed mated or juvenile) mosquitoes would provide direct evidence for peptide-induced behavioral activation. The lack of these experiments weakens the central claim of the paper that these neuropeptides directly promote blood feeding.

      As noted above, we plan to synthesise the peptide to test rescue in a mutant background and sufficiency.

      Some of the claims about knockdown efficiency and interpretation are conflicting; the authors dismiss Hairy and Prp as candidates due to 30-35% knockdown, yet base major conclusions on sNPF and RYamide knockdowns with comparable efficiencies (25-40%). This inconsistency should be addressed, or the justification for different thresholds should be clearly stated.

      We have not defined any specific knockdown efficacy thresholds in the manuscript, as these can vary considerably between genes, and in some cases, even modest reductions can be sufficient to produce detectable phenotypes. For example, knockdown efficiencies of even as low as about 25% - 40% gave us observable phenotypes for sNPF and RYa RNAi (Figure S9B-G).

      No such phenotypes were observed for Hairy (30%) or Prp (35%) knockdowns. Either these genes are not involved in blood feeding, or the knockdown was not sufficient for these specific genes to induce phenotypes. We cannot distinguish between these scenarios.

      The observation that knockdown animals take smaller blood meals is interesting and could reflect a downstream effect of altered host-seeking or an independent physiological change. The relationship between meal size and host-seeking behavior should be clarified.

      We agree with the reviewer that the reduced meal size observed in sNPF and RYa knockdown animals could result from their inability to seek a host or due to an independent effect on blood meal intake. Unfortunately, we did not measure host-seeking in these animals. We plan to distinguish between these possibilities using mutants in future work.

      Several figures are difficult to interpret due to cluttered labeling and poorly distinguishable color schemes. Simplifying these and improving contrast (especially for co-housed vs. virgin conditions) would enhance readability.

      We regret that the reviewer found the figures difficult to follow. We have now revised our annotations throughout the manuscript for enhanced readability. For example, “D1<sup>B</sup>” is now “D1<sup>PBM</sup>” (post-bloodmeal) and “D1<sup>O</sup>” is now “D1<sup>PO</sup>” (post-oviposition). Wherever mated females were used, we have now appended “(m)” to the annotations and consistently depicted these females with striped abdomens in all the schematics. We believe these changes will improve clarity and readability.

      The manuscript does not clearly justify the use of whole-brain RNA sequencing to identify peptides involved in metabolic or peripheral processes. Given that anticipatory feeding signals are often peripheral, the logic for brain transcriptomics should be explained.

      The reviewer is correct in pointing out that feeding signals could also emerge from peripheral tissues. Signals from these tissues – in response to both changing nutritional and reproductive states – are then integrated by the central brain to modulate feeding choices. For example, in Drosophila, increased protein intake is mediated by central brain circuitry including those in the SEZ and central complex (Munch et al., 2022; Liu et al., 2017; Goldschmidt et al., 2023). In the context of mating, male-derived sex peptide further increases protein feeding by acting on a dedicated central brain circuitry (Walker et al., 2015). We, therefore focused on the central brain for our studies.

      The proposed model suggests brain-derived peptides initiate feeding, while gut peptides provide feedback. However, gut-specific knockdowns had no effect, undermining this hypothesis. Conversely, the authors also suggest abdominal involvement based on RNAi results. These contradictions need to be resolved into a consistent model.

      We thank the reviewer for raising this point and recognise their concern. Our reasons for invoking an involvement of the gut were two-fold:

      (1) We find increased sNPF transcript expression in the entero-endocrine cells of the midgut in blood-hungry females, which returns to baseline  after a blood-meal (Fig. 4L, M).

      (2) While the abdomen-only knockdowns did not affect blood feeding, every effective head knockdown that affected blood feeding also abolished abdominal transcript levels (Fig. S9C, F). (Achieving a head-only reduction proved impossible because (i) systemic dsRNA delivery inevitably reaches the abdomen and (ii) abdominal expression of both peptides is low, leaving little dynamic range for selective manipulation.) Consequently, we can only conclude the following: 1) that brain expression is required for the behaviour, 2) that we cannot exclude a contributory role for gut-derived sNPF. We have discussed this in lines 364-371.

      The identification of candidate receptors is promising, but the manuscript would be significantly strengthened by testing whether receptor knockdowns phenocopy peptide knockdowns. Without this, it is difficult to conclude that the identified receptors mediate the behavioral effects.

      We agree that functional validation of the receptors would strengthen the evidence for sNPF and RYa_mediated control of blood feeding in _An. stephensi. We selected these receptors based on sequence homology. A possibility remains that sNPF neuropeptides activate more than one receptor, each modulating a distinct circuit, as shown in the case of Drosophila Tachykinin (https://pmc.ncbi.nlm.nih.gov/articles/PMC10184743/). This will mean a systematic characterisation and knockdown of each of them to confirm their role. We are planning these experiments in the future.

      The authors compared the percentage changes in sugar-fed and blood-fed animals under sugar-sated or sugar-starved conditions. Figure 1F should reflect what was discussed in the results.

      Perhaps this concern stems from our representation of the data in figure 1F? We have now edited the xaxis and revised its label from “choice of food” to “choice made” to better reflect what food the mosquitoes chose to take.

      For clarity, we have now also plotted the same data as stacked graphs at the bottom of Fig. 1F, which clearly shows the proportion of mosquitoes fed on each particular choice. We avoid the stacked graph as the sole representation of this data because it does not capture the variability in the data.

      Minor issues:

      (1) The authors used mosquitoes with belly stripes to indicate mated females. To be consistent, the post-oviposition females should also have belly stripes.

      We thank the reviewer for pointing this out. We have now edited all the figures as suggested.

      (2) In the first paragraph on the right column of the second page, the authors state, "Since females took blood-meals regardless of their prior sugar-feeding status and only sugar-feeding was selectively suppressed by prior sugar access." Just because the well-fed animals ate less than the starved animals does not mean their feeding behavior was suppressed.

      Perhaps there has been a misunderstanding in the experimental setup of figure 1F, probably stemming from our data representation. The experiment is a choice assay in which sugar-starved or sugar-sated females, co-housed with males, were provided simultaneous access to both blood and sugar, and were assessed for the choice made (indicated on the x-axis): both blood and sugar, blood only, sugar only, or neither. We scored females only for the presence or absence of each meal type (blood or sugar) and did not quantify the amount consumed.

      (3) The figure legend for Figure 1A and the naming convention for different experimental groups are difficult to follow. A simplified or consistently abbreviated scheme would help readers navigate the figures and text.

      We regret that the reviewer found the figure difficult to follow. We have now revised our annotations throughout the manuscript for enhanced readability. For example, “D1<sup>B</sup>” is now “D1<sup>PBM</sup>” (post-bloodmeal) and “D1<sup>O</sup>” is now “D1<sup>PO</sup>” (post-oviposition).

      (4) In the last paragraph of the Y-maze olfactory assay for host-seeking behaviour in An. stephensi in Methods, the authors state, "When testing blood-fed females, aged-matched sugar-fed females (bloodhungry) were included as positive controls where ever possible, with satisfactory results." The authors should explicitly describe what the criteria are for "satisfactory results".

      We apologise for the lack of clarity. We have now edited the statement to read:

      “When testing blood-fed females, age-matched sugar-fed females (blood-hungry) were included wherever possible as positive controls. These females consistently showed attraction to host cues, as expected.” See lines 786-790.

      (5) In the first paragraph of the dsRNA-mediated gene knockdown section in Methods, dsRNA against GFP is used as a negative control for the injection itself, but not for the potential off-target effect.

      We agree with the reviewer that dsGFP injections act as controls only for injection-related behavioural changes, and not for off-target effects of RNAi. We have now corrected the statement. See lines 919-920.

      To control for off-target effects, we could have designed multiple dsRNAs targeting different parts of a given gene. We regret not including these controls for potential off-target effects of dsRNAs injected.

      (6) References numbers 48, 89, and 90 are not complete citations.

      We thank the reviewer for spotting these. We have now corrected these citations.

    1. Author Response:

      The following is the authors’ response to the current reviews.

      Reviewer #1 (Public review):

      The scale bar for fly and ovary images should be included in Figures 9, 10, and 12.

      We agree with this comment and apologize for the oversight. We have now modified Figures 9, 10, and 12 to include the scale bars for the ovary images. The fly images were acquired using a stereo microscope where scale bar calculation was not possible. However, all images were acquired at the same magnification for consistency.

      Reviewer #2 (Public review):

      A weakness of this paper is the phylogenetic analysis to investigate if there is correspondence in the phylogenetic distribution of ITP-type and Gyc76C-type genes/proteins. Unfortunately, the evidence presented is rather limited in scope. Essentially, the authors report that they only found ITP-type and Gyc76C-type genes/proteins in protostomes, but not in deuterostomes. What is needed is a more fine-grained analysis at the species level within the protostomes. However, I recognise that such a detailed analysis may extend beyond the scope of this paper, which is already rich in data.

      We thank the reviewer for their comment and the suggestion to perform a fine-grained species level comparison of ITP and Gyc76C genes across protostomes. We are unsure of the utility of this analysis for the present study given that we have now shown that ITPa can activate Gyc76C using both an ex vivo and a heterologous assay, the latter being the gold standard in GPCR and guanylate cyclase discovery (see Huang et al 2025 https://doi.org/10.1073/pnas.2420966122; Beets et al 2023 https://doi.org/10.1016/j.celrep.2023.113058); Chang et al 2009 https://doi.org/10.1073/pnas.0812593106.

      Additionally, absence of a gene in a genome/proteome is hard to prove especially when many/most of the protostomian datasets are not as high-quality as those of model systems (e.g. Drosophila melanogaster and Caenorhabditis elegans). Secondly, based on previous findings in Bombyx mori (Nagai et al. 2014 https://doi.org/10.1074/jbc.m114.590646 and Nagai et al. 2016 https://doi.org/10.1371/journal.pone.0156501) and Drosophila (Xu et al. 2023 https://doi.org/10.1038/s41586-023-06833-8 and our study) it is evident that different products of the ITP gene (ITPa and ITPL) could signal via different receptor types depending on the species. Hence, we would need to explore the presence of several genes (ITP, tachykinin, pyrokinin, tachykinin receptor, pyrokinin receptor, CG30340 orphan receptor and Gyc76C) to fully understand which components of these diverse signaling systems are present in a given species to decipher the potential for cross-talk.

      While this species-level comparison will certainly be useful in the context of ITP-Gyc76C evolution, it will not alter the conclusions of the present study – ITPa acts via Gyc76C in Drosophila. We therefore agree with the reviewer that these analyses are beyond the scope of this paper.


      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):  

      Summary:  

      In Drosophila melanogaster, ITP has functions on feeding, drinking, metabolism, excretion, and circadian rhythm. In the current study, the authors characterized and compared the expression of all three ITP isoforms (ITPa and ITPL1&2) in the CNS and peripheral tissues of Drosophila. An important finding is that they functionally characterized and identified Gyc76C as an ITPa receptor in Drosophila using both in vitro and in vivo approaches. In vitro, the authors nicely confirmed that the inhibitory function of recombinant Drosophila ITPa on MT secretion is Gyc76C-dependent (knockdown Gyc76C specifically in two types of cells abolished the anti-diuretic action of Drosophila ITPa on renal tubules). They also used a combination of multiple approaches to investigate the roles of ITPa and Gyc76C on osmotic and metabolic homeostasis modulation in vivo. They revealed that ITPa signaling to renal tubules and fat body modulates osmotic and metabolic homeostasis via Gyc76C.  

      Furthermore, they tried to identify the upstream and downstream of ITP neurons in the nervous system by using connectomics and single-cell transcriptomic analysis. I found this interesting manuscript to be well-written and described. The findings in this study are valuable to help understand how ITP signals work on systemic homeostasis regulation. Both anatomical and single-cell transcriptome analysis here should be useful to many in the field. 

      We thank this reviewer for the positive and thorough assessment of our manuscript.  

      Strengths:  

      The question (what receptors of ITPa in Drosophila) that this study tries to address is important. The authors ruled out the Bombyx ITPa receptor orthologs as potential candidates. They identified a novel ITP receptor by using phylogenetic, anatomical analysis, and both in vitro and in vivo approaches. 

      The authors exhibited detailed anatomical data of both ITP isoforms and Gyc76C (in the main and supplementary figures), which helped audiences understand the expression of the neurons studied in the manuscript.  

      They also performed connectomes and single-cell transcriptomics analysis to study the synaptic and peptidergic connectivity of ITP-expressing neurons. This provided more information for better understanding and further study on systemic homeostasis modulation.  

      Weaknesses:  

      In the discussion section, the authors raised the limitations of the current study, which I mostly agree with, such as the lack of verification of direct binding between ITPa and Gyc76C, even though they provided different data to support that ITPa-Gyc76C signaling pathway regulates systemic homeostasis in adult flies. 

      We now provide evidence of Gyc76C activation by ITPa in a heterologous system (new Figure 7 and Figure 7 Supplement 1).

      Reviewer #2 (Public Review):  

      Summary:  

      The physiology and behaviour of animals are regulated by a huge variety of neuropeptide signalling systems. In this paper, the authors focus on the neuropeptide ion transport peptide (ITP), which was first identified and named on account of its effects on the locust hindgut (Audsley et al. 1992). Using Drosophila as an experimental model, the authors have mapped the expression of three different isoforms of ITP (Figures 1, S1, and S2), all of which are encoded by the same gene.  

      The authors then investigated candidate receptors for isoforms of ITP. Firstly, Drosophila orthologs of G-protein coupled receptors (GPCRs) that have been reported to act as receptors for ITPa or ITPL in the insect Bombyx mori were investigated. Importantly, the authors report that ITPa does not act as a ligand for the GPCRs TkR99D and PK2-R1 (Figure S3). Therefore, the authors investigated other putative receptors for ITPs. Informed by a previously reported finding that ITP-type peptides cause an increase in cGMP levels in cells/tissues (Dircksen, 2009, Nagai et al., 2014), the authors investigated guanylyl cyclases as candidate receptors for ITPs. In particular, the authors suggest that Gyc76C may act as an ITP receptor in Drosophila.  

      Evidence that Gyc76C may be involved in mediating effects of ITP in Bombyx was first reported by Nagai et al. (2014) and here the authors present further evidence, based on a proposed concordance in the phylogenetic distribution ITP-type neuropeptides and Gyc76C (Figure 2). Having performed detailed mapping of the expression of Gyc76C in Drosophila (Figures 3, S4, S5, S6), the authors then investigated if Gyc76C knockdown affects the bioactivity of ITPa in Drosophila. The inhibitory effect of ITPa on leucokinin- and diuretic hormone-31-stimulated fluid secretion from Malpighian tubules was found to be abolished when expression of Gyc76C was knocked down in stellate cells and principal cells, respectively (Figure 4). However, as discussed below, this does not provide proof that Gyc76C directly mediates the effect of ITPa by acting as its receptor. The effect of Gyc76C knockdown on the action of ITPa could be an indirect consequence of an alteration in cGMP signalling.  

      Having investigated the proposed mechanism of ITPa in Drosophila, the authors then investigated its physiological roles at a systemic level. In Figure 5 the authors present evidence that ITPa is released during desiccation and accordingly, overexpression of ITPa increases survival when animals are subjected to desiccation. Furthermore, knockdown of Gyc76C in stellate or principal cells of Malphigian tubules decreases survival when animals are subject to desiccation. However, whilst this is correlative, it does not prove that Gyc76C mediates the effects of ITPa. The authors investigated the effects of knockdown of Gyc76C in stellate or principal cells of Malphigian tubules on i). survival when animals are subject to salt stress and ii). time taken to recover from of chill coma. It is not clear, however, why animals overexpressing ITPa were also not tested for its effect on i). survival when animals are subject to salt stress and ii). time taken to recover from of chill coma. In Figures 6 and S8, the authors show the effects of Gyc76C knockdown in the female fat body on metabolism, feeding-associated behaviours and locomotor activity, which are interesting. Furthermore, the relevance of the phenotypes observed to potential in vivo actions of ITPa is explored in Figure 7. The authors conclude that "increased ITPa signaling results in phenotypes that largely mirror those seen following Gyc76C knockdown in the fat body, providing further support that ITPa mediates its effects via Gyc76C." Use of the term "largely mirror" seems inappropriate here because there are opposing effects- e.g. decreased starvation resistance in Figure 6A versus increased starvation resistance in Figure 7A. Furthermore, as discussed above, the results of these experiments do not prove that the effects of ITPa are mediated by Gyc76C because the effects reported here could be correlative, rather than causative. 

      We thank this reviewer for an extremely thorough and fair assessment of our manuscript. 

      We have now performed salt stress tolerance and chill coma recovery assays using flies over-expressing ITPa (new Figure 10 Supplement 1).

      We agree that the use of the term “largely mirrors” to describe the effects of ITPa overexpression and Gyc76C knockdown is not appropriate and have changed this sentence. We also agree that the experiments did not provide direct evidence that the effects of ITPa are mediated by Gyc76C. To address this, we now provide evidence of Gyc76C activation by ITPa in a heterologous system (new Figure 7 and Figure 7 Supplement 1).

      Lastly, in Figures 8, S9, and S10 the authors analyse publicly available connectomic data and single-cell transcriptomic data to identify putative inputs and outputs of ITPa-expressing neurons. These data are a valuable addition to our knowledge ITPa expressing neurons; but they do not address the core hypothesis of this paper - namely that Gyc76C acts as an ITPa receptor.  

      The goal of our study was to comprehensively characterize an anti-diuretic system in Drosophila. Hence, in addition to identifying the receptor via which ITPa exerts its effects, we also wanted to understand how ITPa-producing neurons are regulated. Connectomic and single-cell transcriptomic analyses are highly appropriate for this purpose. We have now updated the connectomic analyses using an improved connectome dataset that was released during the revision of this manuscript. Our new analysis shows that lNSC<sup>ITP</sup> are connected to other endocrine cells that produce other homeostatic hormones (new Figure 13F). We also identify a pathway through which other ITP-producing neurons (LNd<sup>ITP</sup>) receive hygrosensory inputs to regulate water seeking behavior (new Figure 13E). Moreover, we now include results which showcase that ITPa-producing neurons (l-NSC<sup>ITP</sup>) are active (new Figure 8A and B) and release ITPa under desiccation. Together with other analyses, these data provide a comprehensive outlook on the when, what and how ITPa regulates systemic homeostasis.  

      Strengths:  

      (1) The main strengths of this paper are i) the detailed analysis of the expression and actions of ITP and the phenotypic consequences of overexpression of ITPa in Drosophila. ii). the detailed analysis of the expression of Gyc76C and the phenotypic consequences of knockdown of Gyc76C expression in Drosophila.  

      (2) Furthermore, the paper is generally well-written and the figures are of good quality. 

      We thank this reviewer for highlighting the strengths of this manuscript.

      Weaknesses:  

      (1) The main weakness of this paper is that the data obtained do not prove that Gyc76C acts as a receptor for ITPa. Therefore, the following statement in the abstract is premature: "Using a phylogenetic-driven approach and the ex vivo secretion assay, we identified and functionally characterized Gyc76C, a membrane guanylate cyclase, as an elusive Drosophila ITPa receptor." Further experimental studies are needed to determine if Gyc76C acts as a receptor for ITPa. In the section of the paper headed "Limitations of the study", the authors recognise this weakness. They state "While our phylogenetic analysis, anatomical mapping, and ex vivo and in vivo functional studies all indicate that Gyc76C functions as an ITPa receptor in Drosophila, we were unable to verify that ITPa directly binds to Gyc76C. This was largely due to the lack of a robust and sensitive reporter system to monitor mGC activation." It is not clear what the authors mean by "the lack of a robust and sensitive reporter system to monitor mGC activation". The discovery of mGCs as receptors for ANP in mammals was dependent on the use of assays that measure GC activity in cells (e.g. by measuring cGMP levels in cells). Furthermore, more recently cGMP reporters have been developed. The use of such assays is needed here to investigate directly whether Gyc76C acts as a receptor for ITPa. In summary, insufficient evidence has been obtained to conclude that Gyc76C acts as a receptor for ITPa. Therefore, I think there are two ways forward, either:  

      (a) The authors obtain additional biochemical evidence that ITPa is a ligand for Gyc76C.  

      or  

      (b) The authors substantially revise the conclusions of the paper (in the title, abstract, and throughout the paper) to state that Gyc76C MAY act as a receptor for ITPa, but that additional experiments are needed to prove this. 

      We thank the reviewer for this comment and agree with the two options they propose. We had previously tried different a cGMP reporter (Promega GloSensor cGMP assay) to monitor activation of Gyc76C by ITPa in a heterologous system. Unfortunately, we were not successful in monitoring Gyc76C activation by ITPa. We now utilized another cGMP sensor, Green cGull, to show that ITPa can indeed activate Gyc76C heterologously expressed in HEK cells (new Figure 7 and Figure 7 Supplement 1). However, we still cannot rule out the possibility that ITPa can act on additional receptors in vivo. This is based on our ex vivo Malpighian tubule assays (new Figure 6E and F). ITPa inhibits DH31- and LK-stimulated secretion and we show that this effect is abolished in Gyc76C knockdown specifically in principal and stellate cells, respectively. Interestingly, application of ITPa alone can stimulate secretion when Gyc76C is knocked down in principal cells (new Figure 6E). This could be explained by: 1) presence of another receptor for ITPa which results in diuretic actions and/or 2) low Gyc76C signaling activity (RNAi based knockdown lowers signaling but does not abolish it completely) could alter other intracellular messenger pathways that promote secretion. We have added text to indicate the possibility of other ITPa receptors. Nonetheless, our conclusions are supported by the heterologous assay results which indicate that ITPa can activate Gyc76C. Therefore, we do not alter the title. 

      (2) The authors state in the abstract that a phylogenetic-driven approach led to their identification of Gyc76C as a candidate receptor for ITPa. However, there are weaknesses in this claim. Firstly, because the hypothesis that Gyc76C may be involved in mediating effects of ITPa was first proposed ten years ago by Nagai et al. 2014, so this surely was the primary basis for investigating this protein. Nevertheless, investigating if there is correspondence in the phylogenetic distribution of ITP-type and Gyc76C-type genes/proteins is a valuable approach to addressing this issue. Unfortunately, the evidence presented is rather limited in scope. Essentially, the authors report that they only found ITP-type and Gyc76C-type genes/proteins in protostomes, but not in deuterostomes. What is needed is a more fine-grained analysis at the species level within the protostomes. Thus, are there protostome species in which both ITP-type and Gyc76C-type genes/proteins have been lost? Furthermore, are there any protostome species in which an ITP-type gene is present but an Gyc76C-type gene is absent, or vice versa? If there are protostome species in which an ITP-type gene is present but a Gyc76C-type gene is absent or vice versa, this would argue against Gyc76C being a receptor for ITPa. In this regard, it is noteworthy that in Figure 2A there are two ITP-type precursors in C. elegans, but there are no Gyc76Ctype proteins shown in the tree in Figure 2B. Thus, what is needed is a more detailed analysis of protostomes to investigate if there really is correspondence in the phylogenetic distribution of Gyc76C-type and ITP-type genes at the species level. 

      We thank the reviewer for this comment. While the previous study by Nagai et al had implicated Gyc76C in the ITP signaling pathway, how they narrowed down Gyc76C as a candidate was not reported. Therefore, our unbiased phylogenetic approach was necessary to ensure that we identified all suitable candidate receptors. Indeed, our phylogenetic analysis also identified Gyc32E as another candidate ITP receptor. However, we did not pursue this receptor further as our expression data (new Figure 4 Supplement 2) indicated that Gyc32E is not expressed in osmoregulatory tissues and therefore likely does not mediate the osmotic effects of ITPa. 

      We also appreciate the suggestion to perform a more detailed phylogenetic analysis for the peptide and receptor. We did not include C. elegans receptors in the phylogenetic analysis because they tend to be highly evolved and routinely cause long-branch attraction (see: Guerra and Zandawala 2024: https://doi.org/10.1093/gbe/evad108). We (specifically the senior author) have previously excluded C. elegans receptors in the phylogenetic analysis of GnRH and Corazonin receptors for similar reasons (see: Tian and Zandawala et al. 2016: 10.1038/srep28788). 

      Unfortunately, absence of a gene in a genome is hard to prove especially when they are not as high-quality as the genomes of model systems (e.g. Drosophila and mice). Moreover, given the concern of this reviewer that our physiological and behavioral data on ITPa and Gyc76C only provide correlative evidence, we decided against performing additional phylogenetic analysis which also provides correlative evidence. Our only goal with this analysis was to identify a candidate ITPa receptor. Since we have now functionally characterized this receptor using a heterologous system, we feel that the current phylogenetic analysis was able to successfully serve its purpose.  

      (3) The manuscript would benefit from a more comprehensive overview and discussion of published literature on Gyc76C in Drosophila, both as a basis for this study and for interpretation of the findings of this study.  

      We thank the reviewer for this comment. We have now included a broader discussion of Gyc76C based on published literature.  

      Reviewer #3 (Public Review):  

      Summary:  

      The goal of this paper is to characterize an anti-diuretic signaling system in insects using Drosophila melanogaster as a model. Specifically, the authors wished to characterize a role of ion transport peptide (ITP) and its isoforms in regulating diverse aspects of physiology and metabolism. The authors combined genetic and comparative genomic approaches with classical physiological techniques and biochemical assays to provide a comprehensive analysis of ITP and its role in regulating fluid balance and metabolic homeostasis in Drosophila. The authors further characterized a previously unrecognized role for Gyc76C as a receptor for ITPa, an amidated isoform of ITP, and in mediating the effects of ITPa on fluid balance and metabolism. The evidence presented in favor of this model is very strong as it combines multiple approaches and employs ideal controls. Taken together, these findings represent an important contribution to the field of insect neuropeptides and neurohormones and have strong relevance for other animals. 

      We thank this reviewer for the positive and thorough assessment of our manuscript.

      Strengths:  

      Many approaches are used to support their model. Experiments were wellcontrolled, used appropriate statistical analyses, and were interpreted properly and without exaggeration.  

      Weaknesses:  

      No major weaknesses were identified by this reviewer. More evidence to support their model would be gained by using a loss-of-function approach with ITPa, and by providing more direct evidence that Gyc76C is the receptor that mediates the effects of ITPa on fat metabolism. However, these weaknesses do not detract from the overall quality of the evidence presented in this manuscript, which is very strong.  

      We agree with this reviewer regarding the need to provide additional evidence using a loss-of-function approach with ITPa. We now characterize the phenotypes following knockdown of ITP in ITP-producing cells (new Figure 9). Our results are in agreement with phenotypes observed following Gyc76C knockdown, lending further support that ITPa mediates its effects via Gyc76C. Unfortunately, we are not able to provide evidence that ITPa acts on Gyc76C in the fat body using the assay suggested by this reviewer (explained in detail below). Instead, we now provide direct evidence of Gyc76C activation by ITPa in a heterologous system (new Figure 7 and Figure 7 Supplement 1).

      Reviewer #1 (Recommendations For The Authors):  

      Here, I have several extra concerns about the work as below:  

      (1) The authors confirmed the function of ITPa in regulating both osmotic and metabolic homeostasis by specifically overexpressing ITPa driven by ITP-RCGal4 in adult flies (Figures. 5 and 7). Have authors ever tried to knock down ITP in ITP-RC-Gal4 neurons? What was the phenotype? Especially regarding the impact on metabolic homeostasis, does knocking down ITP in ITP neurons mimic the phenotypes of Gyc76C fat body knockdown flies? 

      We thank the reviewer for this suggestion. We now characterize the phenotypes following knockdown of ITP using ITP-RC-Gal4 (new Figure 9). Our results are in agreement with phenotypes observed following Gyc76C knockdown, lending further support that ITPa mediates its effects via Gyc76C.

      The authors mentioned that the existing ITP RNAi lines target all three isoforms. It would be interesting if the authors could overexpress ITPa in ITPRC-Gal4>ITP-RNAi flies and confirm whether any phenotypes induced by ITP knockdown could be rescued. It will further confirm the role of ITPa in homeostasis regulation.  

      We thank the reviewer for this suggestion. Unfortunately, this experiment is not straightforward because knockdown with ITP RNAi does not completely abolish ITP expression (see Figure 9A). Hence, the rescue experiment needs to be ideally performed in an ITP mutant background. However, ITP mutation leads to developmental lethality (unpublished observation) so we cannot generate all the flies necessary for this experiment. Therefore, we cannot perform the rescue experiments at this time. In future studies, we hope to perform knockdown of specific ITP isoforms using the transgenes generated here (Xu et al 2023: 10.1038/s41586-023-06833-8).   

      (2) In Figures 5A and B, the authors nicely show the increased release of ITPa under desiccation by quantifying the ITPa immunolabelling intensity in different neuronal populations. It may be induced by the increased neuronal activity of ITPa neurons under the desiccated condition. Have the authors confirmed whether the activity of ITPa-expressing neurons is impacted by desiccation?  

      The TRIC system may be able to detect the different activity of those neurons before and after desiccation. This may further explain the reduced ITPa peptide levels during desiccation.  

      We thank the reviewer for this suggestion. We have now monitored the activity of ITPa-expressing neurons using the CaLexA system (Masuyama et al 2012: 10.3109/01677063.2011.642910). Our results indicate that ITPa neurons are indeed active under desiccation (new Figure 8A and B). These results are also in agreement with ITPa immunolabelling showing increased peptide release during desiccation (new Figure 8C and D). Together, these results show that ITPa neurons are activated and release ITPa under desiccation.  

      (3) What about the intensity of ITPa immunolabelling in other ITPa-positive neurons (e.g., VNC) under desiccation? If there is no change in other ITPa neurons, it will be a good control. 

      We thank the reviewer for this suggestion. Unfortunately, ITPa immunostaining in VNC neurons is extremely weak preventing accurate quantification of ITPa levels under different conditions. We did hypothesize that ITPa immunolabelling in clock neurons (5<sup>th</sup>-LN<sub>v</sub> and LN<Sub>d</sub><sup>ITP</sup>) would not change depending on the osmotic state of the animal. However, our results (Figure 8C and D) indicate that ITPa from these neurons is also released under desiccation. Interestingly, LNd<sup>ITP</sup>, which also coexpress Neuropeptide F (NPF) have recently been implicated in water seeking during thirst (Ramirez et al, 2025: 10.1101/2025.07.03.662850). Our new connectomic-driven analysis shows that these neurons can receive thermo/hygrosensory inputs (new Figure 13E). Hence, it is conceivable that other ITPa-expressing neurons also release ITPa during thirst/desiccation.

      (4) The adult stage, specifically overexpression of ITPa in ITP neurons, does show significant phenotypes compared to controls in both osmotic and metabolic homeostasis-related assays. It would be helpful if authors could show how much ITPa mRNA levels are increased in the fly heads with ITPa overexpression (under desiccation & starvation or not). 

      We thank the reviewer for this suggestion. We have now included immunohistochemical evidence showing increase in ITPa peptide levels in flies with ITPa overexpression (new Figure 10A). We feel that this is a better indicator of ITPa signaling level instead of ITPa mRNA levels.   

      (5) Another question concerns the bloated abdomens of ITPa-overexpressing flies. Are the bloated abdomens of ITPa OE female flies (Figure 5E) due to increased ovary size (Figure 7G)? Have the authors also detected similar bloated abdomens in male flies with ITPa overexpression? Since both male and female flies show more release of ITPa during the desiccation.  

      We thank the reviewer for this comment. The bloated abdomen phenotype seen in females can be attributed to increased water content since we see a similar phenotype in males (see Author response image 1 below).

      Author response image 1.

      Reviewer #2 (Recommendations For The Authors):  

      (1) Page 1 - change "Homeostasis is obtained by" to "Homeostasis is achieved by".  

      Changed

      (2) Page 1 - change "Physiological responses" to "Physiological processes". 

      Changed

      (3) Page 2 - Change "Recently, ITPL2 was also shown to mediate anti-diuretic effects via the tachykinin receptor" to "Recently, ITPL2 was also shown to exert anti-diuretic effects via the tachykinin receptor". 

      Changed

      (4) Page 9 - "(C) Adult-specific overexpression of ITPa using ITP- RC-GAL4TS (ITP-RC-T2A-GAL4 combined with temperature-sensitive tubulinGAL80) increases desiccation" Unless I am misunderstanding Fig 5C, I think what is shown is that overexpression of ITPa prolongs survival during a period of desiccation. I am not sure what the authors mean by "increases desiccation". In the text (page 9) the authors state "ITPa overexpression improves desiccation tolerance, which is a much clearer statement than what is in the figure legend. 

      We thank the reviewer for identifying this oversight. We have now changed the caption to “increases desiccation tolerance”.  

      (5) Page 11 - The authors conclude that "increased ITPa signaling results in phenotypes that largely mirror those seen following Gyc76C knockdown in the fat body, providing further support that ITPa mediates its effects via Gyc76C." Use of the term "largely mirror" seems inappropriate here because there are opposing effects- e.g. decreased starvation resistance in Figure 6A versus increased starvation resistance in Figure 7A.  

      Perhaps there is a misunderstanding of what is meant by "mirroring" - it means the same, not the opposite. 

      We thank the reviewer for this comment. We agree that the use of the term “largely mirrors” to describe the effects of ITPa overexpression and Gyc76C knockdown is not appropriate and have changed this sentence as follows: “Taken together, the phenotypes seen following Gyc76C knockdown in the fat body largely mirror those seen following ITP knockdown in ITP-RC neurons, providing further support that ITPa mediates its effects via Gyc76C.”

      (6) Page 12 - There appear to be words missing between "neurons during desiccation, as well as their downstream" and "the recently completed FlyWire adult brain connectome" 

      We thank the reviewer for highlighting this mistake. We have changed the sentence as following: “Having characterized the functions of ITP signaling to the renal tubules and the fat body, we wanted to identify the factors and mechanisms regulating the activity of ITP neurons during desiccation, as well as their downstream neuronal pathways. To address this, we took advantage of the recently completed FlyWire adult brain connectome (Dorkenwald et al., 2024, Schlegel et al., 2024) to identify pre- and post-synaptic partners of ITP neurons.”

      (7) Page 15 - "can release up to a staggering 8 neuropeptides" - I suggest that the word "staggering" is removed. The notion that individual neurons release many neuropeptides is now widely recognised (both in vertebrates and invertebrates) based on analysis of single-cell transcriptomic data. 

      Removed staggering.

      (8) Page 16 - "(Farwa and Jean-Paul, 2024)" - this citation needs to be added to the reference list and I think it needs to be changed to "Sajadi and Paluzzi, 2024". 

      We thank the reviewer for highlighting this oversight. The correct citation has now been added.

      (9) It is noteworthy that, based on a PubMed search, there are at least thirteen published papers that report on Gyc76C in Drosophila (PMIDs: 34988396, 32063902, 27642749, 26440503, 24284209, 23862019, 23213443,  21893139, 21350862, 16341244, 15485853, 15282266, 7706258). However, none of these papers are discussed/cited by the authors. This is surprising because the authors' hypothesis that Gyc76C acts as a receptor for ITPa surely needs to be evaluated and discussed with reference to all the published insights into the developmental/physiological roles of this protein. 

      We thank the reviewer for this comment. Some of the references mentioned above (21350862, 16341244, 15485853) mainly report on soluble guanylyl cyclases and not membrane guanylyl cyclase like Gyc76C. Based on other studies on Gyc76C and its role in immunity and development, we have now expanded the discussion on additional roles of ITPa.

      Reviewer #3 (Recommendations For The Authors):  

      I have only a few comments that will help the authors strengthen a couple of aspects of their model.  

      (1) The case for Gyc76C as a receptor for ITPa in regulating fluid homeostasis is clear, given the experiments the authors carried out where they applied ITPa to tubules and showed that the effects of ITPa on tubule secretion were blocked if Gyc76C was absent in tubules. This approach, or something similar, should be used to provide conclusive proof that ITPa's metabolic effects on the fat body go through Gyc76C.  

      At present (unless I missed it) the authors only show that gain of ITPa has the opposite phenotype to fat body-specific loss of Gyc76C. While this would be the expected result if ITPa/Gyc76C is a ligand-receptor pair, it is not quite sufficient to conclusively demonstrate that Gyc76C is definitely the fat body receptor. Ex vivo experiments such as soaking the adult fat body carcasses with and without Gyc76C in ITPa and monitoring fat content via Nile Red could be one way to address this lack of direct evidence. The authors could also make text changes to explicitly mention this lack of conclusive evidence and suggest it as a future direction.

      We thank the reviewer for this comment. We have now conclusively demonstrated that Gyc76C is activated by ITPa in a heterologous assay (new Figure 7 and Figure 7 Supplement 1). With this evidence, we can confidently claim that ITPa can mediate its actions via Gyc76C in various tissues including the Malpighian tubules and fat body. Nonetheless, we liked the suggestion by this reviewer to perform the ex vivo assay and test the effect of ITPa on the fat body. Unfortunately, it is challenging to do this because increased ITPa signaling (chronically using ITPa overexpression) results in increased lipid accumulation in the fat body in vivo. Therefore, we would likely not see the effect of ITPa addition in an ex vivo fat body preparation since lipogenesis will not occur in the absence of glucose. However, ITPa could counteract the effects of other lipolytic factors such as adipokinetic hormone (AKH). To test this hypothesis, we monitored fat content in the fat body incubated with and without AKH (see Author response image 2 below showing representative images from this experiment). Since we did not observe any differences in fat levels between these two conditions, we were unable to test the effects of ITPa on AKH-activity using this assay.

      Author response image 2.

      (2) I did not see any loss of function data for ITPa - is this possible? If so this would strengthen the case for a 1:1 relationship between loss of ligand and loss of receptor. Alternatively, the authors could suggest this as an important future direction. 

      We agree with this reviewer regarding the need to provide additional evidence using a loss-of-function approach with ITPa. We have now characterized the phenotypes following knockdown of ITP in ITP-producing cells (new Figure 9). Our results are in agreement with phenotypes observed following Gyc76C knockdown, lending further support that ITPa mediates its effects via Gyc76C.

      (3) For clarity, please include the sex of all animals in the figure legend. Even though the methods say 'females used unless otherwise indicated' it is still better for the reader to know within the figure legend what sex is displayed. 

      We thank the reviewer for this suggestion and have now included sex of the animals in the figure legends.  

      (4) Please state whether females are mated or not, as this is relevant for taste preferences and food intake. 

      We apologize for this oversight. We used mated females for all experiments. This has now been included in the methods.  

      (5) More discussion on the previous study on metabolic effects of ITP in this study compared with past studies would help readers appreciate any similarities and/or differences between this study and past work (Galikova 2018, 2022) 

      We thank the reviewer for this suggestion. Unfortunately, it is difficult to directly compare our phenotypes with the metabolic effects of ITP reported in Galikova and Klepsatel 2022 because the previous study used a ubiquitous driver (Da-GAL4) to manipulate ITP levels. Ectopically overexpressing ITPa in non-ITP producing cells can result in non-physiological phenotypes. This is evident in their metabolic measurements where both global overexpression and knockdown of ITP results in reduced glycogen and fat levels, and starvation tolerance. Moreover, ITP-RC-GAL4 used in our study to overexpress and knockdown ITPa is more specific than the Da-GAL4 used previously. Da-GAL4 would include other ITP cells (e.g. ITP-RD producing cells). Since ITP is broadly expressed across the animal, it is difficult to parse out the phenotypes of ITPa and other isoforms using manipulations performed with Da-GAL4. We have mentioned this limitation in the results for ITP knockdown as follows: “A previous study employing ubiquitous ITP knockdown and overexpression suggests that Drosophila ITP also regulates feeding and metabolic homeostasis (Galikova and Klepsatel, 2022) in addition to osmotic homeostais (Galikova et al., 2018). However, given the nature of the genetic manipulations (ectopic ITPa overexpression and knockdown of ITP in all tissues) utilized in those studies, it is difficult to parse the effects of ITP signaling from ITPa-producing neurons.”

    1. Author response:

      The following is the authors’ response to the original reviews

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      What are the overarching principles by which prokaryotic genomes evolve? This fundamental question motivates the investigations in this excellent piece of work. While it is still very common in this field to simply assume that prokaryotic genome evolution can be described by a standard model from mathematical population genetics, and fit the genomic data to such a model, a smaller group of researchers rightly insists that we should not have such preconceived ideas and instead try to carefully look at what the genomic data tell us about how prokaryotic genomes evolve. This is the approach taken by the authors of this work. Lacking a tight theoretical framework, the challenge of such approaches is to devise analysis methods that are robust to all our uncertainties about what the underlying evolutionary dynamics might be.

      The authors here focus on a collection of ~300 single-cell genomes from a relatively well-isolated habitat with relatively simple species composition, i.e. cyanobacteria living in hotsprings in Yellowstone National Park, and convincingly demonstrate that the relative simplicity of this habitat increases our ability to interpret what the genomic data tells us about the evolutionary dynamics.

      Using a very thorough and multi-faceted analysis of these data, the authors convincingly show that there are three main species of Synechococcus cyanobacteria living in this habitat, and that apart from very frequent recombination within each species (which is in line with insights from other recent studies) there is also a remarkably frequent occurrence of hybridization events between the different species, and with as of yet unidentified other genomes. Moreover, these hybridization events drive much of the diversity within each species. The authors also show convincing evidence that these hybridization events are not neutral but are driven by selected by natural selection.

      Strengths:

      The great strength of this paper is that, by not making any preconceived assumptions about what the evolutionary dynamics is expected to look like, but instead devising careful analysis methods to tease apart what the data tells us about what has happened in the evolution in these genomes, highly novel and unexpected results are obtained, i.e. the major role of hybridization across the 3 main species living in this habitat.

      The analysis is very thorough and reading the detailed supplementary material it is clear that these authors took a lot of care in devising these methods and avoiding the pitfalls that unfortunately affect many other studies in this research area.

      The picture of the evolutionary dynamics of these three Synechococcus species that emerge from this analysis is highly novel and surprising. I think this study is a major stepping stone toward the development of more realistic quantitative theories of genome evolution in prokaryotes.

      The analysis methods that the authors employ are also partially novel and will no doubt be very valuable for analysis of many other datasets.

      We thank the reviewer for their appreciation of our work.

      Weaknesses:

      I feel the main weakness of this paper is that the presentation is structured such that it is extremely difficult to read. I feel readers have essentially no chance to understand the main text without first fully reading the 50-page supplement with methods and 31 supplementary materials. I think this will unfortunately strongly narrow the audience for this paper and below in the recommendations for the authors I make some suggestions as to how this might be improved.<br /> A very interesting observation is that a lot of hybridization events (i.e. about half) originate from species other than the alpha, beta, and gamma Synechococcus species from which the genomes that are analyzed here derive. For this to occur, these other species must presumably also be living in the same habitat and must be relatively abundant. But if they are, why are they not being captured by the sampling? I did not see a clear explanation for this very common occurrence of hybridization events from outside of these Synechococcus species. The authors raise the possibility that these other species used to live in these hot springs but are now extinct. I'm not sure how plausible this is and wonder if there would be some way to find support for this in the data (e.g that one does not observe recent events of import from one of these unknown other species). This was one major finding that I believe went without a clear interpretation.

      We agree with the reviewer that the extent of hybridization with other species is surprising. While we do feel that our metagenome data provide convincing evidence that “X” species are not present in MS or OS, we cannot currently rule out the presence of X in other springs. In the revision we explicitly mention the alternative hypothesis (Lines 239-242).

      The core entities in the paper are groups of orthologous genes that show clear evidence of hybridization. It is thus very frustating that exactly the methods for identifying and classifying these hybridization events were really difficult to understand (sections I and V of the supplement). Even after several readings, I was unsure of exactly how orthogroups were classified, i.e. what the difference between M and X clusters is, what a `simple hybrid' corresponds to (as opposed to complex hybrids?), what precisely the definitions of singlet and non-singlet hybrids are, etcetera. It also seems that some numbers reported in the main text do not match what is shown in the supplement. For example, the main text talks about "around 80 genes with more than three clusters (SM, Sec. V; fig. S17).", but there is no group with around 80 genes shown in Fig S17! And similarly, it says "We found several dozen (100 in α and 84 in β) simple hybrid loci" and I also cannot match those numbers to what is shown in the supplement. I am convinced that what the authors did probably made sense. But as a reader, it is frustrating that when one tries to understand the results in detail, it is very difficult to understand what exactly is going on. I mention this example in detail because the hybrid classification is the core of this paper, but I had similar problems in other sections.

      We thank the reviewer for pointing out these issues with our original presentation. In the revision, we have redone most of the analysis to simplify the methods and check the consistency of the results. We did not find any qualitative differences in our results after reanalysis, but some of the numbers for different hybridization patterns have changed. The most notable difference is an increase in the number of alpha-gamma simple hybrids and a corresponding decrease in mixed-species clusters (now labeled mosaic hybrids). These transfers are difficult to assign because we only have access to a single gamma genome. We have added a short explanation of this point in Lines 219-222.

      To improve the presentation, we significantly expanded the “Results” section to better explain our analysis and the different steps we take. We included two additional figures (Figs. 3 and 4) that illustrate the different types of hybrids and the heterogeneity in the diversity of alpha which is discussed in the main text and is important for interpreting our results. We also included two additional figures (Figs. 2 and 6) that were previously in the Appendix but were mentioned in the main text. We believe these changes should address most of the issues raised by the reviewer and hopefully make the manuscript easier to read.

      Although I generally was quite convinced by the methods and it was clear that the authors were doing a very thorough job, there were some instances where I did not understand the analysis. For example, the way orthogroups were built is very much along the lines used by many in the field (i.e. orthoMCL on the graph of pairwise matchings, building phylogenies of connected components of the graph, splitting the phylogenies along long branches). But then to subdivide orthogroups into clusters of different species, the authors did not use the phylogenetic tree already built but instead used an ad hoc pairwise hierarchical average linkage clustering algorithm.

      The reviewer is correct that there is an unexplained discrepancy between the clustering methods we used at different steps in our pipeline. We followed previous work by using phylogenetic distances for the initial clustering of orthogroups. On these scales we expect hybridization to play a minor role and phylogenetic distances to correlate reasonably well with evolutionary divergence. However, because of the extensive hybridization we observed, the use of phylogenetic models for species clustering is more difficult to justify. We therefore chose to simply use pairwise nucleotide distances, which make fewer assumptions about the underlying evolutionary processes and should be more robust. We have briefly explained our reasoning and the details of our clustering method in the revision (Lines 182-190).

      Reviewer #2 (Public Review):

      Summary:

      Birzu et al. describe two sympatric hotspring cyanobacterial species ("alpha" and "beta") and infer recombination across the genome, including inter-species recombination events (hybridization) based on single-cell genome sequencing. The evidence for hybridization is strong and the authors took care to control for artefacts such as contamination during sequencing library preparation. Despite hybridization, the species remain genetically distinct from each other. The authors also present evidence for selective sweeps of genes across both species - a phenomenon which is widely observed for antibiotic resistance genes in pathogens, but rarely documented in environmental bacteria.

      Strengths:

      This manuscript describes some of the most thorough and convincing evidence to date of recombination happening within and between cohabitating bacteria in nature. Their single-cell sequencing approach allows them to sample the genetic diversity from two dominant species. Although single-cell genome sequences are incomplete, they contain much more information about genetic linkage than typical short-read shotgun metagenomes, enabling a reliable analysis of recombination. The authors also go to great lengths to quality-filter the single-cell sequencing data and to exclude contamination and read mismapping as major drivers of the signal of recombination.

      We thank the reviewer for their appreciation of our work.

      Weaknesses:

      Despite the very thorough and extensive analyses, many of the methods are bespoke and rely on reasonable but often arbitrary cutoffs (e.g. for defining gene sequence clusters etc.). Much of this is warranted, given the unique challenges of working with single-cell genome sequences, which are often quite fragmented and incomplete (30-70% of the genome covered). I think the challenges of working with this single-cell data should be addressed up-front in the main text, which would help justify the choices made for the analysis.

      We have significantly expanded the “Results” section to better justify and explain the choices we made during our analysis. We hope these changes address the reviewer’s concerns and make the manuscript more accessible to readers.

      The conclusions could also be strengthened by an analysis restricted to only a subset of the highest quality (>70% complete) genomes. Even if this results in a much smaller sample size, it could enable more standard phylogenetic methods to be applied, which could give meaningful support to the conclusions even if applied to just ~10 genomes or so from each species. By building phylogenetic trees, recombination events could be supported using bootstraps, which would add confidence to the gene sequence clustering-based analyses which rely on arbitrary cutoffs without explicit measures of support.

      It seems to us that the reviewer’s suggestion presupposes that the recombination events we find can be described as discrete events on an asexual phylogeny, similar to how rare mutations are treated in standard phylogenetic inference. Popular tools, such as ClonalFrame and its offshoots, have attempted to identify individual recombination events starting from these assumptions. But the main conclusion of both our linkage and SNP block analysis is that the ClonalFrame assumptions do not hold for our data. Under a clonal frame, the SNP blocks we observe should be perfectly linked, similar to mutations on an asexual tree. But our results in Fig. 7D show the opposite. Part of the issue may have been that in our original presentation, we only briefly discuss the results of our linkage analysis and refer readers to the Appendix for more details. To fix this issue we have added an extra figure (Fig. 2), showing rapid linkage decrease in both species and that at long distances the linkage values are essentially identical to the unlinked case, similar to sexual populations. We hope that this change will help clarify this point.

      The manuscript closes without a cartoon (Figure 4) which outlines the broad evolutionary scenario supported by the data and analysis. I agree with the overall picture, but I do think that some of the temporal ordering of events, especially the timing of recombination events could be better supported by data. In particular, is there evidence that inter-species recombination events are increasing or decreasing over time? Are they currently at steady-state? This would help clarify whether a newly arrived species into the caldera experiences an initial burst of accepting DNA from already-present species (perhaps involving locally adaptive alleles), or whether recombination events are relatively constant over time.

      The reviewer raises some very interesting questions about the dynamics of recombination in the population, which we hope to pursue in future work. We have added this as an open question in the Discussion (Lines 365-382).

      These questions could be answered by counting recombination events that occur deeper or more recently in a phylogenetic tree.

      The reviewer here seems to presuppose that recombination is rare enough that a phylogenetic tree can reliably be inferred, which is contrary to our linkage analysis (see the response to an earlier comment). Perhaps the reviewer missed this point in our original manuscript since it was discussed primarily in the Appendix. See also our response to a previous comment by the reviewer.

      The cartoon also shows a 'purple' species that is initially present, then donates some DNA to the 'blue' species before going extinct. In this model, 'purple' DNA should also be donated to the more recently arrived 'orange' species, in proportion to its frequency in the 'blue' genome. This is a relatively subtle detail, but it could be tested in the real data, and this may actually help discern the order of the inferred recombination events.

      We have included an extra figure in the main text (Fig. 6) that addresses the question of timing of events. A quantitative test of our cartoon model along the lines the reviewer suggested would certainly be worthwhile and we hope to do that in future work.  

      The abstract also makes a bold claim that is not well-supported by the data: "This widespread mixing is contrary to the prevailing view that ecological barriers can maintain cohesive bacterial species..." In fact, the two species are cohesive in the sense that they are identifiable based on clustering of genome-wide genetic diversity (as shown in Fig 1A). I agree that the mixing is 'widespread' in the sense that it occurs across the genome (as shown in Figure 2A) but it is clearly not sufficient to erode species boundaries. So I believe the data is consistent with a Biological Species Concept (sensu Bobay & Ochman, Genome Biology & Evolution 2017) that remains 'fuzzy' - such that there are still inter-species recombination events, just not sufficient to erode the cohesion of genomic clusters. Therefore, I think the data supports the emerging picture of most bacteria abiding by some version of a BSC, and is not particularly 'contrary' to the prevailing view.

      We have revised the phrase mentioned by the reviewer to “prevent genetic mixture between bacterial species,” which more accurately represents our conclusions. 

      The final Results paragraph begins by posing a question about epistatic interactions, but fails to provide a definitive answer to the extent of epistasis in these genomes. Quantifying epistatic effects in bacterial genomes is certainly of interest, but might be beyond the scope of this paper. This could be a Discussion point rather than an underdeveloped section of the Results.

      We agree with the reviewer that an exhaustive analysis of epistasis in the population is beyond the scope of the manuscript. Our original intention was to answer whether SNP blocks we discovered showed evidence of strong linkage, as might be expected if only a small number of strains are present in the population. In light of the previous comments by the reviewer regarding the consistency with the clonal frame hypothesis, we believe this is especially relevant for our results. Moreover, the results we found‑especially for the beta population‑were quite conclusive: SNP block linkages in beta are indistinguishable from an unlinked model. To avoid misdirecting the reader about the significance of our results, we have revised the relevant paragraph (Lines 316-319).

      Recommendations For The Authors:

      Reviewer #1 (Recommendations For The Authors):

      Although I am entirely convinced of the validity of the results, methodology, and interpretations presented in this work, I must say I found the paper very hard to read. And I think I am really quite familiar with these kinds of approaches. I fear that for people other than experts on these kinds of comparative genomic analyses, this paper will be almost impossible to read. With the aim of expanding the audience for this compelling work, I think the authors might want to consider ways to improve the presentation.

      At the end of a long project, the obtained results typically form a web of mutual interconnections and dependencies and one of the key challenges in presenting the results in a paper is having to untangle this web of connected results and analysis into a linear ordered narrative so that, at any point in the narrative, understanding the next point only depends on previous points in the narrative. I frankly feel that this paper fails at this.

      The paper reads to me as if one author put together the supplement by essentially writing a report of all the analyses that were done together with supplementary figures summarizing all those analyses, and that another author then wrote the main text by using the materials in the supplement almost in the way a cook uses ingredients for a dish. Almost every other sentence in the main text refers to results in the (31!) supplementary figures and can only be understood by reading the appropriate corresponding sections in the supplementary materials. I found it essentially impossible to read the main text without having first read the entire 50-page supplement.

      I think the paper could be hugely improved by trying to restructure the presentation so as to make it more linear. The main text can be expanded to include a summary of the crucial methods and analysis results from the supplement needed to understand the narrative in the main text. For example, as it currently stands it is really challenging to understand what is shown in figures 2 and 3 of the main text without having to first read a very substantial part of the supplement. Figure 3, even after having read the relevant sections in the supplement, took me quite a while to understand and almost felt like a puzzle to decypher. Rethinking which parts of the supplement are really necessary would also help. Finally, it would also help if the terminology was kept as simple, transparent, and consistent as possible.

      I understand that my suggestion to thoroughly reorganize the presentation may feel like a big hassle, but I am afraid that in its current form, these important results are essentially rendered inaccessible to all but a small group of experts in this area. This paper deserves a wider readership.

      We thank the reviewer for these valuable suggestions. In the revision, we have significantly expanded and restructured the “Results” section to make the presentation more linear, as the reviewer suggested (see our reply to the public comment by the reviewer for details). We hope these changes will make the manuscript easier to read.

      Reviewer #2 (Recommendations For The Authors):

      I found this paper challenging to follow since the main text was so condensed and the supplementary material so extensive. Given that eLife does not impose strong limits on the length of the main text, I suggest moving some key sections from the supplement into the main text to make it easier for the reader to follow rather than flipping back and forth. Adding to the confusion, supplementary figures were referenced out of order in the main text (e.g. S23 is referenced before S1). Please check the numbering and ensure figures are mentioned in the main text in the correct order.

      We thank the reviewer for their feedback on the presentation of the results. In response to similar comments from Reviewer #1, we have significantly expanded and restructured the “Results” section to make it easier to read (see also our responses to Reviewer #1).

      Page 2: The term 'coevolution' is typically reserved for two species that mutually impose selective pressures on one another (e.g. predator-prey interactions; see Janzen, Evolution 1980). In the context of these two cyanobacterial species, it's not clear that this is the case so I would simply refer to them 'cohabitating' or being sympatric in the same environment.

      It is true that the term "coevolution” has become associated with predator-prey interactions, as the reviewer said. However, we feel that in our case “coevolution” fairly accurately describes the continual hybridization over long time scales we observe. We have therefore chosen to keep the term.

      Page 3: The authors mention that the gamma SAG is ~70% complete, which turns out to be quite high. It would be useful to mention early in the Results the mean/median completeness across SAGs, and how this leads to some challenges in analysing the data. Some of the material from the Supplement could be moved into the Results here.

      We have added a short note on the completeness in the Results (Lines 153-154). We have also added an extra figure in Appendix 1 with the completeness of all the SAGs for interested readers.

      I was left puzzled by the sentence: "Alternatively, high rates of recombination could generate different genotypes within each genome cluster that are adapted to different temperatures, with the relative frequencies of each cluster being only a correlated and not a causal driver of temperature adaptation." This is suggesting that individual genes or alleles, rather than entire genomes, could be adapted to temperature. But figure 1B seems to imply that the entire genome is adapted to different temperatures. Anyway, this does not seem to be a key point and could probably be removed (or clarified if the authors deem this an important point, which I failed to understand).

      We have revised this section to clarify the alternative hypothesis mentioned by the reviewer (Lines 100-103).

      Page 4. 'Several dozen' hybrid genes were found, but please also specify how many genes were tested. In general, it would be good to briefly outline the sample size (SAGs or genes) considered for each analysis.

      We have added the total numbers of genes we analyzed at each step of our analysis.

      'Mosaic hybrid loci' are mentioned alongside the issue of poor alignment. Presumably, the mosaic hybrid loci are first filtered to remove the poor alignments? This should be specified, and please mention how many loci are retained before/after this filter.

      We thank the reviewer for highlighting this important point. In the revision, we have implemented a more aggressive filtering of genes with poor alignments. We have added an extra paragraph to Appendix 1 (step 5 in the pipeline analysis) briefly explaining the issue.

      Page 5. "By contrast, the diversity of mosaic loci was typical of other loci within beta, suggesting most of the beta genome has undergone hybridization." Please point to the data (figure) to support this statement.

      We have restructured our discussion of the different hybrid loci so this comment is no longer relevant. In case the reviewer is interested, the synonymous diversity within beta was 0.047, while in mosaic hybrids it was 0.064.

      Page 6. "The largest diversity trough contained 28 genes." Since this trough is discussed in detail and seems to be of interest, it would be nice to illustrate it, perhaps as an inset in Figure 2 or as a separate figure. If I understood correctly, this trough includes genes (in a nitrogen-fixation pathway) that are present in all genomes, but are exchanged by homologous recombination. So I don't think it's correct to say that the "ancestors acquired the ability to fix nitrogen." Rather, the different alleles of these same genes were present in the ancestor. So perhaps there was a selective sweep involving alleles in this region that provided adaptation to local nitrogen sources or concentrations, but not a gain of new genes. Perhaps I misunderstood, in which case clarification would be appreciated.

      The reviewer raises an interesting possibility. We agree that it is in principle possible that the ancestor contained the nitrogen fixation genes and the selective sweep simply replaced the ancestral alleles. In this particular case, there is additional evidence that the entire pathway was acquired around roughly the same time from gene order. The gene order between alpha and beta is almost entirely different, with only a few segments containing more than 2-3 genes in the same order, as shown by Bhaya et al. 2007 and confirmed by additional unpublished analysis of the SAGs. One of the few exceptions is the nitrogen fixation pathway, which has essentially the same gene order over more than 20 kbp. Thus, if the ancestor of both alpha and beta contained the nitrogen-fixation pathway, we would expect these genes to be scatter across the genome. We have revised the sentences in question to clarify this point (Lines 260-271).

      Page 6. Last paragraph on epistasis references Fig 3C, but I believe it should be Fig 3D.

      Fixed.

      Page 7. Figure 3 legend. "Note that alpha-2 is identical to gamma here." I believe it should be beta, not gamma.

      The reviewer is correct. We have fixed this error.

      Page 8. What is the evidence for "at least six independent colonizers"? I could not find the data supporting this claim.

      The statement mentioned by the reviewer was based on the maximum number of species clusters we identified in different core genes. However, during the revision, we found that only a handful of genes contained five or more clusters. We did find several tens of genes with four clusters. In addition, Rosen et al. (2018) also found additional 16S clusters at low frequency in the same springs. Based on these results we conservatively estimate that at least four independent strains colonized the caldera, but the number could be much greater. We have revised the text in question accordingly (Lines 336-339) and added Fig. 2 in Appendix 1 to support the conclusion.

      Page 9. Line 200: "acting to homogenize the population." It should be specified that the population is only homogenized at these introgressed loci, not genome-wide. Otherwise, the genome-wide species clusters seen in Fig 1 would not be maintained.

      It is true that the selective sweeps that lead to diversity throughs only homogenize the introgressed loci. But other hybrid segments could also rise to high frequency in the population during the sweep through hitchhiking. The fact that we observe SNP blocks generated through secondary recombination events of introgressed segments throughout the genome supports this view. While we do not fully understand the dynamics of this process currently, we do feel that the current evidence supports the statement that mixing is occurring throughout the genome and not just at a few loci so we have kept the original statement.

      The final sentence (lines 221-222) is vague and uninformative. On the one hand, "investigating whether hybridization plays a major role" is what the current manuscript has already done - depending on what is meant by 'major' (how much of the genome? Or whether there are ecological implications?). It is also not clear what is meant by a predictive theory and 'possible evolutionary scenarios. This should be elaborated upon, otherwise, it is not clear what the authors mean. Otherwise, this sentence could be cut.

      We thank the reviewer for their feedback. One possible source of confusion could be that in this sentence we were referring to detecting hybridization in other communities. We have changed “these communities” to “other communities” to make this clearer.

      Supplement.

      Broadly speaking, I appreciate the thorough and careful analysis of the single cell data. On the other hand, it is hard to evaluate whether these custom analyses are doing what is intended in many cases. Would it be possible to consider an analysis using more established methods, e.g. taking a subset of genomes with 'good' completeness and using Panaroo to find the core and accessory genome, then ClonalFrameML or Gubbins to infer a phylogeny and recombination events? Such analyses could probably be applied to a subset of the sample with relatively complete genomes. I don't want to suggest an overly time-consuming analysis, but the authors could consider what would be feasible.

      We have added a comparison between our analysis and that from two other methods, including ClonalFrameML mentioned by the author. One important point that we feel might have been lost in the first version is that our linkage results imply that recombination is not rare such that it can be mapped onto an asexual tree as assumed by ClonalFrameML. Note that this is not simply due to technical limitations due to incomplete coverage and is instead a consequence of the evolutionary dynamics of the population. Consistent with this, we found several inconsistencies in how recombination events were assigned by ClonalFrameML. We have summarized these conclusions in Appendix 7 of the revised manuscript.

      Page 8. Line 190. What is meant by 'minimal compositional bias'?

      We mean that the sample is not biased towards strains that grow in the lab. We have revised the sentence to clarify.

      Page 25. Figure S14 is not referenced in the text.

      We have added part of this figure to the main text since it illustrates one of our main results, namely that sites at long genomic distances are essentially unlinked.

      Page 26. The 'unlinked controls' (line 530) are very useful, but it would be even more informative to see if these controls also show the same decline in linkage with distance in the genome as observed in the real data. In particular, it would be good to know if the observed rapid decline in linkage with distance in the low-diversity regions is also observed in controls. Currently, it is unclear if this observation might be due to higher uncertainty in inferring linkage in low-diversity regions, which by definition have less polymorphism to include in the linkage calculation.

      We thank the reviewer for the suggestion. After further consideration, we have decided to remove the subsection on linkage decrease in the low-diversity regions. We feel such detailed quantitative analysis would be better suited for a more technical paper, which we hope to do at a later time.

      Page 26. There are some sections with missing identifiers (Sec ??).

      Fixed.

      Page 27. The information about the typical breadth of SAG coverage (~30%) would be better to include earlier in the Supplement, and also mentioned in the main text so the reader can more easily understand the nature of the dataset.

      We have added an extra figure with the SAG coverages to Appendix 1.

      Page 29. Any sensitivity analysis around the S = 0.9 value? Even if arbitrary, could the authors provide justification why they think this value is reasonable?

      We have significantly revised this section in response to earlier comments by one of the reviewers. We hope that this would clarify the details of our methods to interested readers. To answer the reviewer’s specific question, we chose this heuristic after examining the fraction of cells of each species in different species clusters. For the clusters assigned to alpha and beta, we found a sharp peak near one and that a cutoff of 0.9 captured most clusters while still being high enough to inconsistent with a mixed cluster.

      Page 30. I could not see where Fig. S17 was mentioned in the text. Also, how are 'simple hybrid genes' defined?

      We have removed this figure in the revision. The definition of the different types of hybrid genes have been added to the main text in response to a comment from the other reviewer.

      Page 36. It is hard to see that divergence is 'high' relative to what reference. Would it be possible to include the expected value (from ref. 12) in the plot, or at least explicitly mentioned in the text?

      We have added the mean synonymous and non-synonymous divergences between alpha and beta to the figures for reference.

      Page 38. Line 770 "would be comparable to that of beta." This is not necessarily the case since beta could have a different time to its most recent common ancestor. It could have a different time to the last bottleneck or selective sweep, etc.

      We thank the reviewer for pointing out this misleading statement. Our point here was that in the first scenario the TMRCA of alpha and beta would be similar since the diversity in the high-diversity alpha genes is similar to beta. We have clarified this statement in the revision.

      Page 39. Line 793. The use of the term 'genomic backbone' implies the presence of a clonal frame, which is not what the data seems to support. Perhaps another term such as 'genetic diversity' would more appropriately capture the intended meaning here.

      We agree with the reviewer that the low-diversity regions may not be asexual. We used “genomic backbone” to distinguish from the “clonal frame,” which is usually used to mean that the backbone is asexual. We have added a note in the revision to clarify this point.

      Page 39. Lines 802-805. I found this explanation hard to follow. Could the logic be clarified?

      We simply meant that although the beta distribution is unimodal, it is not consistent with a simple Poisson distribution, unlike in alpha. We have added an extra sentence to clarify this.

    1. Author response:

      The following is the authors’ response to the previous reviews

      Public Reviews:

      Reviewer #2 (Public review):

      In this valuable manuscript, Lin et al attempt to examine the role of long non coding RNAs (lncRNAs) in human evolution, through a set of population genetics and functional genomics analyses that leverage existing datasets and tools. Although the methods are incomplete and at times inadequate, the results nonetheless point towards a possible contribution of long non coding RNAs to shaping humans, and suggest clear directions for future, more rigorous study.

      Comments on revisions:

      I thank the authors for their revision and changes in response to previous rounds of comments. As it had been nearly two years since I last saw the manuscript, I reread the full text to familiarise myself again with the findings presented. While I appreciate the changes made and think they have strengthened the manuscript, I still find parts of it a bit too speculative or hyperbolic. In particular, I think claims of evolutionary acceleration and adaptation require more careful integration with existing human/chimpanzee genetics and functional genomics literature.

      We thank the reviewer heartfully for the great patience and valuable comments, which have helped us further improve the manuscript. Before responding to comments point by point, we provide a summary here.

      (1) On parameters and cutoffs.

      Parameters and cutoffs influence data analysis. The large number of Supplementary Notes, Supplementary Figures, and Supplementary Tables indicates that we paid great attention to the influence of parameters and robustness of analyses. Specifically, here we explain the DBS sequence distance cutoff of 0.034, which determines the top 20% genes that most differentiate humans from chimpanzees and influences the gene set enrichment analysis (Figure 2). As described in the revised manuscript, we estimated this cutoff based on Song et al., verified its rationality based on Prufer et al. (Song et al. 2021; Prufer et al. 2017), and measured its influence by examining slightly different cutoff values (e.g., 0.035).

      (2) Analyses of HS TFs and HS TF DBSs.

      It is desirable to compare the contribution of HS lncRNAs and HS TFs to human evolution. Identifying HS TFs faces the challenges that different institutions (e.g., NCBI and Ensembl) annotate orthologous genes using different criteria, and that multiple human TF lists have been published by different research groups. Recently, Kirilenko et al. identified orthologous genes in hundreds of placental mammals and birds and organized different types of genes into datasets of parewise comparison (e.g., hg38-panTro6) using humans and mice as references (Kirilenko et al. Integrating gene annotation with orthology inference at scale. Science 2023). Based on (a) the many2zero and one2zero gene lists in the “hg38-panTro6” dataset, (b) three human TF lists reported by two studies (Bahram et al. 2015; Lambert et al. 2018) and used in the SCENIC package, we identified HS TFs. The number of HS TFs and HS lncRNAs (5 vs 66) alone lends strong evidence suggesting that HS lncRNAs have contributed more significantly to human evolution than HS TFs (note that 5 is the union of three intersections between <many2zero + one2zero> and the three <human TF list>).

      TF DBS (i.e., TFBS) prediction has also been challenging because they are very short (mostly about 10 bp) and TF-DNA binding involves many cofactors (Bianchi et al. Zincore, an atypical coregulator, binds zinc finger transcription factors to control gene expression. Science 2025). We used two TF DBS prediction programs to predict HS TF DBSs, including the well-established FIMO program (whose results have been incorporated into the JASPAR database) (Rauluseviciute et al. JASPAR 2024: 20th anniversary of the open-access database of transcription factor binding profiles Open Access. NAR 2023) and the recently reported CellOracle program (Kamimoto et al. Dissecting cell identity via network inference and in silico gene perturbation. Nature 2023). Then, we performed downstream analyses and obtained two major results. One is that on average (per base), fewer selection signals are detected in HS TF DBSs (anyway, caution is needed because TF DBSs are very short); the other is that HS TFs and HS lncRNAs contribute to human evolution in quite different ways (Supplementary Figs. 25 and 26).

      (3) On genes with more transcripts may appear as spurious targets of HS lncRNAs.

      Now, the results of HS TF DBSs allow us to address the question of whether genes with more transcripts may appear as spurious targets of HS lncRNAs. We note that (a) we predicted HS lncRNA DBSs and HS TF DBSs in the same promoter regions before the same 179128 Ensembl-annotated transcripts (release 79), (b) we used the same GTEx transcript expression matrices in the analyses of HS TF DBSs and HS lncRNA DBSs (the GTEx database includes gene expression matrices and transcript expression matrices, the latter includes multiple transcripts of a gene). Thus, the analyses of HS TF DBSs provide an effective control for examining the question of whether genes with more transcripts may appear as spurious targets of HS lncRNAs, and consequently, cause the high percentages of HS lncRNA-target transcript pairs that show correlated expression in the brain (Figure 3). We find that the percentages of HS TF-target transcript pairs that show correlated expression are also high in the brain, but the whole profile in GTEx tissues is significantly different from that of HS lncRNA DBSs (Figure 3A; Supplementary Figure 25). On the other hand, on the distribution of significantly changed DBSs in GTEx tissues, the difference between HS lncRNA DBSs and HS TF DBSs is more apparent (Figure 3B; Supplementary Figure 26). Together, these suggest that the brain-enriched distribution of co-expressed HS lncRNA-target transcript pairs must arise from HS lncRNA-mediated transcriptional regulation rather than from the transcript number difference.

      (4) Additional notes on HS TFs and HS TF DBSs.

      First, the “many2zero” and “one2zero” gene lists in the “hg38-panTro6” dataset of Kirilenko et al. provide the most update, but not most complete, data on human-specific genes because “hg38-panTro6” is a pairwise comparison. On the other hand, the Ensembl database also annotates orthologous genes, but lacks such pairwise comparisons as “hg38-panTro6”. Therefore, not all HS genes based on “hg38-panTro6” agree with orthologous genes in the Ensembl database. Second, if HS genes are identified based on both Ensembl and Kirilenko et al., HS TFs will be fewer.

      (5) On speculative or hyperbolic claims.

      First, the title “Human-specific lncRNAs contributed critically to human evolution by distinctly regulating gene expression” is now further supported by HS TF DBSs analyses. Second, we have carefully revised the entire manuscript, trying to make it more readable, accurate, logically reasonable, and biologically acceptable. Third, specifically, in the revision, we avoid speculative or hyperbolic claims in results, interpretations, and discussions as possible as we can. This includes the tone-down of statements and claims, for example, using “reshape” to replace “rewire” and using “suggest” to replace “indicate”. Since the revisions are pervasive, we do not mark all of them, except those that are directly relevant to the reviewer’s comments.

      (1) Line 155: "About 5% of genes have significant sequence differences in humans and chimpanzees," This statement needs a citation, and a definition of what is meant by 'significant', especially as multiple lines below instead mention how it's not clear how many differences matter, or which of them, etc.

      Different studies give different estimates, from 1.24% (Ebersberger et al. Genomewide Comparison of DNA Sequences between Humans and Chimpanzees. Am J Hum Genet. 2002) to 5% (Britten RJ. Divergence between samples of chimpanzee and human DNA sequences is 5%, counting indels. PNAS 2002). The 5% for significant gene sequence differences arises when considering a broader range of genetic variations, particularly insertions and deletions of genetic material (indels). To provide more accurate information, we have replaced this simple statement with a more comprehensive one and cited the above two papers.

      (2) line 187: "Notably, 97.81% of the 105141 strong DBSs have counterparts in chimpanzees, suggesting that these DBSs are similar to HARs in evolution and have undergone human-specific evolution." I do not see any support for the inference here. Identifying HARs and acceleration relies on a far more thorough methodology than what's being presented here. Even generously, pairwise comparison between two taxa only cannot polarise the direction of differences; inferring human-specific change requires outgroups beyond chimpanzee.

      Here, we actually made an analogy but not an inference; therefore, we used such words as “suggesting” and “similar” instead of using more confirmatory words. We have revised the latter half sentence, saying “raising the possibility that these sequences have evolved considerably during human evolution”.

      (3) line 210: "Based on a recent study that identified 5,984 genes differentially expressed between human-only and chimpanzee-only iPSC lines (Song et al., 2021), we estimated that the top 20% (4248) genes in chimpanzees may well characterize the human-chimpanzee differences". I do not agree with the rationale for this claim, and do not agree that it supports the cutoff of 0.034 used below. I also find that my previous concerns with the very disparate numbers of results across the three archaics have not been suitably addressed.

      (1) Indeed, “we estimated that the top 20% (4248) genes in chimpanzees may well characterize the human-chimpanzee differences” is an improper claim; we made this mistake due to the flawed use of English.

      (2) What we need is a gene number, which (a) indicates genes that effectively differentiate humans from chimpanzees, (b) can be used to set a DBS sequence distance cutoff. Since this study is the first to systematically examine DBSs in humans and chimpanzees, we must estimate this gene number based on studies that identify differentially expressed genes in humans and chimpanzees. We choose Song et al. 2021 (Song et al. Genetic studies of human–chimpanzee divergence using stem cell fusions. PNAS 2021), which identified 5984 differentially expressed genes, including 4377 genes whose differential expression is due to trans-acting differences between humans and chimpanzeees. To the best of our knowledge, this is the only published data on trans-acting differences between humans and chimpanzeees, and most HS lncRNAs and their DBSs/targets have trans-acting relationships (see Supplementary Table 2). Based on these numbers, we chose a DBS sequence distance cutoff of 0.034, which corresponds to 4248 genes (the top 20%), slightly fewer than 4377.

      (3) If we chose DBS sequence distance cutoff=0.033 or 0.035, slightly more or fewer genes would be determined, raising the question of whether they would significantly influence the downstream gene set enrichment analysis (Figure 2). We found that 91 genes have a DBS sequence distance of 0.034. Thus, if cutoff=0.035, 4248-91=4157 genes were determined, and the influence on gene set enrichment analysis was very limited.

      (4) On the disparate numbers of results across the three archaics. Figure 1A is based on Figure 2 in Prufer et al. 2017. At first glance, our Figure 1A indicates that Altai Neanderthal is older than Denisovan (upon kya), making our result “identified 1256, 2514, and 134 genes in Altai Neanderthals, Denisovans, and Vindija Neanderthals” unreasonable. However, Prufer et al. (2017) reported that “It has been suggested that Denisovans received gene flow from a hominin lineage that diverged prior to the common ancestor of modern humans, Neandertals, and Denisovans……In agreement with these studies, we find that the Denisovan genome carries fewer derived alleles that are fixed in Africans, and thus tend to be older, than the Altai Neandertal genome”. This note by Prufer et al. provides an explanation for our result, which is that more genes with large DBS sequence distances were identified in Denisovans than in Altai Neanderthals. Of course, the 1256, 2514, and 134 depend on the cutoff of 0.034. If cutoff=0.035, these numbers change slightly, but their relationships remain (i.e., more genes in Denisovans). We examined multiple cutoff values and found that more genes in Denisovans have large DBS sequence distances than in Altai Neanderthals.

      (4) I also think that there is still too much of a tendency to assume that adaptive evolutionary change is the only driving force behind the observed results in the results. As I've stated before, I do not doubt that lncRNAs contribute in some way to evolutionary divergence between these species, as do other gene regulatory mechanisms; the manuscript leans down on it being the sole, or primary force, however, and that requires much stronger supporting evidence. Examples include, but are not limited to:

      (1) Indeed, the observed results are also caused by other genomic elements and mechanisms (but it is hardly feasible to identify and differentiate them in a single study), and we do not assume that adaptive evolutionary change is the only driving force. Careful revisions have been made to avoid leaving readers the impression that we have this tendency or hold the simple assumption.

      (2) Comparing HS lncRNAs to HS TFs is critical, and we have done this.

      (5) line 230: "These results reveal when and how HS lncRNA-mediated epigenetic regulation influences human evolution." This statement is too speculative.

      We have toned down the statement, just saying “These results provide valuable insights into when and how HS lncRNA-mediated epigenetic regulation impacts human evolution”.

      Line 268: "yet the overall results agree well with features of human evolution." What does this mean? This section is too short and unclear.

      (1) First, the sentence “Selection signals in YRI may be underestimated due to fewer samples and smaller sample sizes (than CEU and CHB), yet the overall results agree well with features of human evolution” has been deleted, because CEU, CHB, and YRI samples are comparable (100, 99, and 97, respectively).

      (2) Now the sentence has been changed to “These results agree well with findings reported in previous studies, including that fewer selection signals are detected in YRI (Sabeti et al., 2007; Voight et al., 2006)”.

      (3) On “This section is too short and unclear” - To make the manuscript more readable, we adopt short sections instead of long ones. This section expresses that (a) our finding that more selection signals were detected in CEU and CHB than in YRI agrees with well-established findings (Voight et al. A Map of Recent Positive Selection in the Human Genome. PLoS Biology 2006; Sabeti et al. Genome-wide detection and characterization of positive selection in human populations. Nature 2007), (b) in considerable DBSs, selection signals were detected by multiple tests.

      Line 325: "and form 198876 HS lncRNA-DBS pairs with target transcripts in all tissues." This has not been shown in this paper - sequence based analyses simply identify the “potential” to form pairs.

      This section describes transcriptomic analysis using the GTEx data. Indeed, target transcripts of HS lncRNAs are results of sequence-based analysis, and a predicted target is not necessarily regulated by the HS lncRNA in a tissue. Here, “pair” means a pair of HS lncRNA-target transcript whose expression shows significant Pearson correlation in a GTEx tissue (by the way, we do not mean correlation equals regulation; actually, we identified HS lncRNA-mediated transcriptional regulation upon both DBS-targeting relationship and correlation relationship).

      Line 423: "Our analyses of these lncRNAs, DBSs, and target genes, including their evolution and interaction, indicate that HS lncRNAs have greatly promoted human evolution by distinctly rewiring gene expression." I do not agree that this conclusion is supported by the findings presented - this would require significant additional evidence in the form of orthogonal datasets.

      (1) As mentioned above, we have used “reshape” to replace “rewire” and used “suggest” to replace “indicate”. In addition, we have substantially revised the Discussion, in which this sentence is replaced by “our results suggest that HS lncRNAs have greatly reshaped (or even rewired) gene expression in humans”.

      (2) Multiple citations have been added, including Voight et al. 2006 (Voight et al. A Map of Recent Positive Selection in the Human Genome. PLoS Biology 2006) and Sabeti et al. 2007 (Sabeti et al. Genome-wide detection and characterization of positive selection in human populations. Nature 2007).

      (3) We have analyzed HS TF DBSs, and the obtained results also support the critical contribution of HS lncRNAs.

      I also return briefly to some of my comments before, in particular on the confounding effects of gene length and transcript/isoform number. In their rebuttal the authors argued that there was no need to control for this, but this does in fact matter. A gene with 10 transcripts that differ in the 5' end has 10 times as many chances of having a DBS than a gene with only 1 transcript, or a gene with 10 transcripts but a single annotated TSS. When the analyses are then performed at the gene level, without taking into account the number of transcripts, this could introduce a bias towards genes with more annotated isoforms. Similarly, line 246 focuses on genes with "SNP numbers in CEU, CHB, YRI are 5 times larger than the average." Is this controlled for length of the DBS? All else being equal a longer DBS will have more SNPs than a shorter one. It is therefore not surprising that the same genes that were highlighted above as having 'strong' DBS, where strength is impacted by length, show up here too.

      (1) In gene set enrichment analysis (Figure 2, which is a gene-level analysis), when determining genes differentiating humans from chimpanzees based on DBS sequence distance, if a gene has multiple transcripts/DBSs, we choose the DBS with the largest distance. That is, the input to g:Profiler is a non-redundant gene list.

      (2) In GTEx data analysis (Figure 3, which is a transcriptome-level analysis), the analyses of HS TF DBSs using the GTEx data provide evidence suggesting that different DBS/transcript numbers of genes are unlikely to cause confounding effects. As explained above, we predicted HS TF DBSs in the same promoter regions of 179128 Ensembl-annotated transcripts (release 79), but Supplementary Figures 25 and 26 are distinctly different from Figure 3AB.

      (3) In evolutionary analysis, a gene with 10 DBSs has a higher chance of having selection signals than a gene with 1 DBS. This is biologically plausible, because many conserved genes have novel transcripts whose expression is species-, tissue-, or developmental period-specific, and DBSs before these novel transcripts may differ from DBSs before conserved transcripts.

      (4) “line 246 focuses on genes with "SNP numbers in CEU, CHB, YRI are 5 times larger than the average." Is this controlled for the length of the DBS?” - This is a defect. We have now computed SNP numbers per base and used the new table to replace the old Supplementary Table 8. After examining the new table, we find that the major results of SNP analysis remain.

      (5) On “Is this controlled for length of the DBS? All else being equal a longer DBS will have more SNPs than a shorter one” - We do not think there are reasons to control for the length of DBSs; also, what “All else being equal” means matters. First, DBS sequences have specific features; thus, the feature of a long DBS is stronger than the feature of a short one, making a long DBS less likely to be generated by chance in the genome and less likely to be predicted wrongly than a short one. This means that longer DBSs are less likely to be false ones (note our explanation that the chance of a DBS of 147 bp, the mean length of DBSs, to be wrongly predicted is extremely low, p<8.2e-19 to 1.5e-48). Second, the difference in length suggests a difference in binding affinity, which in turn influences the regulation of the specific transcripts and influences the analysis of GTEx data. Third, it cannot be excluded that some SNPs may be selection signals (detecting selection signal is challenging, and many selection signals cannot be detected by statistical tests, see Grossman et al. A composite of multiple signals distinguishes causal variants in regions of positive selection. Science 2010).

      (6) On “It is therefore not surprising that the same genes that were highlighted above as having 'strong' DBS, where strength is impacted by length” - Indeed, strength is influenced by length, see the above response.

      Recommendations for the authors:

      Reviewer #2 (Recommendations for the authors):

      Finally, figure 1 panels D and F are not legible - the font is tiny! There's also a typo in panel A, where "Homo Sapien" should be "Homo sapiens".

      (1) “Homo sapien” is changed to “Homo sapiens”.

      (2) Even if we double the font size, they are still too small. Inserting a very large panel D into Figure 1 will make Figure 1 ugly, and converting Figure 1D into an independent figure is unnecessary. Actually, panels 1D and F are illustrative figures; the full Fig.1D is Supplementary Figure 6, and the full Fig.1F is Figure 3. We have revised Fig.1’s legend to explain these.

    1. Author response:

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      This unique study reports original and extensive behavioral data collected by the authors on 21 living mammal taxa in zoo conditions (primates, tree shrew, rodents, carnivorans, and marsupials) on how descent along a vertical substrate can be done effectively and securely using gait variables. Ten morphological variables reflecting head size and limb proportions are examined in relationship to vertical descent strategies and then applied to reconstruct modes of vertical descent in fossil mammals.

      Strengths:

      This is a broad and data-rich comparative study, which requires a good understanding of the mammal groups being compared and how they are interrelated, the kinematic variables that underlie the locomotion used by the animals during vertical descent, and the morphological variables that are associated with vertical descent styles. Thankfully, the study presents data in a cogent way with clear hypotheses at the beginning, followed by results and a discussion that addresses each of those hypotheses using the relevant behavioral and morphological variables, always keeping in mind the relationships of the mammal groups under investigation. As pointed out in the study, there is a clear phylogenetic signal associated with vertical descent style. Strepsirrhine primates much prefer descending tail first, platyrrhine primates descend sideways when given a choice, whereas all other mammals (with the exception of the raccoon) descend head first. Not surprisingly, all mammals descending a vertical substrate do so in a more deliberate way, by reducing speed, and by keeping the limbs in contact for a longer period (i.e., higher duty factors).

      Weaknesses:

      The different gait patterns used by mammals during vertical descent are a bit more difficult to interpret. It is somewhat paradoxical that asymmetrical gaits such as bounds, half bounds, and gallops are more common during descent since they are associated with higher speeds and lower duty factors. Also, the arguments about the limb support polygons provided by DSDC vs. LSDC gaits apply for horizontal substrates, but perhaps not as much for vertical substrates.

      We analyzed gait patterns using methods commonly found in the literature and discussed our results accordingly. However, the study of limbs support polygons was indeed developed specifically for studying locomotion on horizontal supports, and may not be applicable for studying vertical locomotion, which is in fact a type of locomotion shared by all arboreal species. In the future, it would be interesting to consider new methods for analyzing vertical gaits.

      The importance of body mass cannot be overemphasized as it affects all aspects of an animal's biology. In this case, larger mammals with larger heads avoid descending head-first. Variation in trunk/tail and limb proportions also covaries with different vertical descent strategies. For example, a lower intermembral index is associated with tail-first descent. That said, the authors are quick to acknowledge that the five lemur species of their sample are driving this correlation. There is a wide range of intermembral indices among primates, and this simple measure of forelimb over hindlimb has vital functional implications for locomotion: primates with relatively long hindlimbs tend to emphasize leaping, primates with more even limb proportions are typically pronograde quadrupeds, and primates with relatively long forelimbs tend to emphasize suspensory locomotion and brachiation. Equally important is the fact that the intermembral index has been shown to increase with body mass in many primate families as a way to keep functional equivalence for (ascending) climbing behavior (see Jungers, 1985). Therefore, the manner in which a primate descends a vertical substrate may just be a by-product of limb proportions that evolved for different locomotor purposes. Clearly, more vertical descent data within a wider array of primate intermembral indices would clarify these relationships. Similarly, vertical descent data for other primate groups with longer tails, such as arboreal cercopithecoids, and particularly atelines with very long and prehensile tails, should provide more insights into the relationship between longer tail length and tail-first descent observed in the five lemurs. The relatively longer hallux of lemurs correlates with tail-first descent, whereas the more evenly grasping autopods of platyrrhines allow for all four limbs to be used for sideways descent. In that context, the pygmy loris offers a striking contrast. Here is a small primate equipped with four pincer-like, highly grasping autopods and a tail reduced to a short stub. Interestingly, this primate is unique within the sample in showing the strongest preference for head-first descent, just like other non-primate mammals. Again, a wider sample of primates should go a long way in clarifying the morphological and behavioral relationships reported in this study.

      We agree with this statement. In the future, we plan to study other species, particularly large-bodied ones with varied intermembral indexes.

      Reconstruction of the ancient lifestyles, including preferred locomotor behaviors, is a formidable task that requires careful documentation of strong form-function relationships from extant species that can be used as analogs to infer behavior in extinct species. The fossil record offers challenges of its own, as complete and undistorted skulls and postcranial skeletons are rare occurrences. When more complete remains are available, the entire evidence should be considered to reconstruct the adaptive profile of a fossil species rather than a single ("magic") trait.

      We completely agree with this, and we would like to emphasize that our intention here was simply to conduct a modest inference test, the purpose of which is to provide food for thought for future studies, and whose results should be considered in light of a comprehensive evolutionary model.

      Reviewer #2 (Public review):

      Summary:

      This paper contains kinematic analyses of a large comparative sample of small to medium-sized arboreal mammals (n = 21 species) traveling on near-vertical arboreal supports of varying diameter. This data is paired with morphological measures from the extant sample to reconstruct potential behaviors in a selection of fossil euarchontaglires. This research is valuable to anyone working in mammal locomotion and primate evolution.

      Strengths:

      The experimental data collection methods align with best research practices in this field and are presented with enough detail to allow for reproducibility of the study as well as comparison with similar datasets. The four predictions in the introduction are well aligned with the design of the study to allow for hypothesis testing. Behaviors are well described and documented, and Figure 1 does an excellent job in conveying the variety of locomotor behaviors observed in this sample. I think the authors took an interesting and unique angle by considering the influence of encephalization quotient on descent and the experience of forward pitch in animals with very large heads.

      Weaknesses:

      The authors acknowledge the challenges that are inherent with working with captive animals in enclosures and how that might influence observed behaviors compared to these species' wild counterparts. The number of individuals per species in this sample is low; however, this is consistent with the majority of experimental papers in this area of research because of the difficulties in attaining larger sample sizes.

      Yes, that is indeed the main cost/benefit trade-off with this type of study. Working with captive animals allows for large comparative studies, but there is a risk of variations in locomotor behavior among individuals in the natural environment, as well as few individuals per species in the dataset. That is why we plan and encourage colleagues to conduct studies in the natural environment to compare with these results. However, this type of study is very time-consuming and requires focusing on a single species at a time, which limits the comparative aspect.

      Figure 2 is difficult to interpret because of the large amount of information it is trying to convey.

      We agree that this figure is dense. One possible solution would be to combine species by phylogenetic groups to reduce the amount of information, as we did with Fig. 3 on the dataset relating to gaits. However, we believe that this would be unfortunate in the case of speed and duty factor because we would have to provide the complete figure in SI anyway, as the species-level information is valuable. We therefore prefer to keep this comprehensive figure here and we will enlarge the data points to improve their visibility, and provide the figure with a sufficiently high resolution to allow zooming in on the details.

    1. Author response:

      The following is the authors’ response to the current reviews.

      Reviewer #2 had several remaining suggestions:

      In some instances, the authors face well-known limitations. For example, bath application of drugs. Blockers of Gly and Gaba receptors are likely problematic when studying a network that includes a diverse set of inhibitory interneurons. Likewise, the results derived from application of AMPAR and KAR blockers should impact HC cell fxn, and presumably inner retina interneuron networks. In the Discussion the authors are encouraged to address more of these concerns (e.g., Discussion line 709).

      Rather than concluding that the bath application of drugs is without complications, they can conclude that under the experimental conditions, glutamate release from these On-bipolars continues to exhibit Transient and Sustained release. This is really the key point of their study.

      This is a good suggestion.  We have added a discussion of the complications of the pharmacology starting on line 754.  

      If indeed sustained release is a reflection of higher release rates, ribbon size is what point to but, there are many other possibilities, such as SV recycling, or recruitment of reserve pools of SVs, fusion machinery, Cav channel behavior. The authors could cite more literature in the Discussion.

      We added a sentence to this effect in the discussion, starting on line 866.


      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review): 

      Summary: 

      In the retina, parallel processing of cone photoreceptor output under bright light conditions dissects critical features of our visual environment and is fundamental to visual function. Cone photoreceptor signals are sampled by several types of bipolar cells and passed onto the ganglion cells. At the output of retinal processing, retinal ganglion cells send about 40 different codes of the visual scene to the brain for further processing. In this study, the authors focus on whether subtype-specific differences in the size of synaptic ribbon-associated vesicle pools of bipolar cells contribute to different retinal ganglion cell (RGC) responses. Specifically, inputs to ON alpha RGCs producing transient versus sustained kinetics (ON-S vs. ON-T, respectively) are compared. The authors first demonstrate that ON-S vs. ON-T RGCs are readily identifiable in a whole mount preparation and respond differently to both static and to a spatially uniform, randomly fluctuating (Gaussian noise) light stimulus. Liner-nonlinear (LN) models were used to estimate the transformation between visual input and excitatory synaptic input for each RGCs; these models suggested the presence of transient versus sustained kinetics already in the excitatory inputs to ON-T and ON-S RGCs. Indeed, the authors show that (glutamatergic) excitatory inputs to ON-S vs. ON-T RGCs are of distinct kinetics. The subtypes of bipolar cells providing input to ON-S are known (i.e., type 6 and 7), but the source of excitatory bipolar inputs to ON-T RGCs needed to be determined. In a tedious process, it is elegantly shown here that ON-T RGCs receive most of their excitatory inputs from type 5 and 6 bipolars. Interestingly, the temporal properties of light-evoked responses of type 5, 6, and 7 bipolars recorded from the somas were indistinguishable and rather sustained, suggesting that the origin of transient kinetics of excitatory inputs to ON-T RGCs suggested by the LN model might be found in the processing of visual signals at the bipolar cell axon terminal. Blocking GABA- or glycinergic inhibitory inputs did not alter the light-evoked excitatory input kinetics to ON-T and ON-S RGCs. Twophoton glutamate sensor imaging revealed significantly faster kinetics of light-evoked glutamate signals at ON-T versus ON-S RGCs. Detailed EM analysis of bipolar cell ribbon synapses onto ON-T and ON-S RGCs revealed fewer ribbon-associated vesicles at ON-T synapses, which is consistent with stronger paired-flash depression of lightevoked excitatory currents in ON-T RGCS versus ON-S RGCs. This study suggests that bipolar subtype-specific differences in the size of synaptic ribbon-associated vesicle pools contribute to transient versus sustained kinetics in RGCs. 

      Strengths: 

      The use of multiple, state-of-the-art tools and approaches to address the kinetics of bipolar to ganglion cell synapse in an identified circuit. 

      Weaknesses: 

      For the most part, the data in the paper support the conclusions, and the authors were careful to try to address questions in multiple ways. Two-photon glutamate sensor imaging experiment showing that blocking GABA- and glycinergic inhibition does not change the kinetics of light-evoked glutamate signals at ON-T RGCs would strengthen the conclusion that bipolar subtype-specific differences in the size of synaptic ribbon-associated vesicle pools contribute to transient versus sustained kinetics in RGCs. 

      Thank you for this suggestion. We have revised the text throughout to be careful not to imply that amacrine cells have no role in shaping EPSCs and spike output, but instead that the transience of the On-T responses persists without amacrine cells (see for example lines 91, 450-453, 514-518, 696-714). We have also added additional iGluSnFR experiments to the paper to further test this conclusion (new Figure 7). The new data shows that the transience of glutamate release from the On-T cells is retained when 1) spiking amacrine cell activity is suppressed by blocking voltage-gated Na<sup>+</sup> channels with TTX or 2) all amacrine cell activity is suppressed by blocking AMPA receptors with NBQX. This does provide nice additional evidence that amacrine cells are not necessary for the sustained/transient distinction.

      Reviewer #2 (Public Review): 

      Summary: 

      Goal of the study. The authors tried to pinpoint the origins of transient and sustained responses measured at retinal ganglion cells (rgcs), which is the output layer of the retina. Response characteristics of rgcs are used to group them into different types. The diversity of rgc types represents the ability of the retina to transform visual inputs into distinct output channels. They find that the physical dimensions of bipolar cell's synaptic ribbons (specialized release sites/active zones) vary across the different types of cone on-bpcs, in ways that they argue could facilitate transient or sustained release. This diversity of release output is what they argue underlies the differences in on-rgcs response characteristics, and ultimately represents a mechanism for creating parallel cone-driven channels. 

      Strengths: 

      The major strengths of the study are the anatomical approaches employed and the use of the "glutamate sniffer" to assay synaptic glutamate levels. The outline of the study is elegant and reflects the strengths of the authors. 

      Weaknesses: 

      The major weakness is that the ambitious outline is not matched with a complete set of results, and the set of physiological protocols is disjointed, not sufficient to bridge the systems-level question with the presynaptic release question. 

      Thank you for this comment as it provides an opportunity (here and in the paper) for us to clarify our main goal. We wanted to link the well-established distinction between transient and sustained retinal responses to anatomy. This required locating where this difference arises within the circuitry – which we show to be at least largely the bipolar output synapse – and then examining the structure of this synapse in detail. While we would certainly be interested in connecting our results to a biophysical description of the synapse, that was not the primary focus of our study and was not something we could add without substantial additional work.  

      Major comments on the results and suggestions. 

      The ribbon model of release has been explored for decades and needs to be further adapted to systems-level work. The study under consideration by Kuo et al. takes on this task. Unfortunately, the experimental design does not permit a level of control over presynaptic/bpc behavior that is comparable to earlier studies, nor do they manipulate release in ways that test the ribbon model (i.e., paired recordings or Ribeye-ko). Furthermore, the data needs additional evaluation, and the presentation and interpretations should draw on published biophysical and molecular studies. 

      As described above, our goal was to test several possible explanations for the difference between transient and sustained responses in OnT and OnS ganglion cells: (1) differences in the light responses of the bipolar cells that convey photoreceptor signals to the relevant ganglion cells; (2) shaping of bipolar transmitter release by presynaptic inhibition; (3) shaping of ganglion cell responses by postsynaptic inhibition or spike generation; (4) differences in feedforward bipolar synapses. We were surprised to find that the feedforward bipolar synapses play a central role in this difference, and your comment nicely prompts us to relate this to the large literature on biophysical studies of release from ribbon synapses. We have made substantial revisions in the text to do this. This includes anticipating the importance of feedforward synaptic properties in the abstract and introduction (lines 36-37 and 61-64), pointers in the results (lines 539-548), and several new paragraphs in the discussion (starting on lines 751, 773 and 787). By showing that the transient/sustained differences originates largely at feedforward bipolar synapses, we set the stage for future work that shows how biophysical properties of the synapse shape physiological signals that traverse it.

      To build a ribbon-centric context, consider recent literature that supports the assertion that ribbons play a role in forming AZ release sites and facilitating exocytosis. Reference Ribeye-ko studies. For example, ribbonless bpcs show an 80% reduction in release (Maxeiner et al EMBO J 2016), the ribbonless retina exhibits signaling deficits at the output layer (Okawa et al ...Rieke, ..Wong Nat Comm 2019), and ribbonless rods show an 80% reduction the readily releasable pool (RRP) of SVs (Grabner Moser, elife 2021). In addition, the authors could refer to whole-cell membrane capacitance studies on mammalian rods, cones, and bpcs, because the size of the RRP of SVs scales with the dimensions and numbers of ribbons (total ribbon footprint). For comparison, bipolars see the review by Wan and Heidelberger 2011. For a comparison of mammalian rods and cones, see, rods: Grabner and Moser (2021 eLife), Mueller.. Regus Leidig et al. (2019; J Neurosci) and cones Grabner ...DeVries (Nat Comm 2023). A comparison of cell types shows that the extent of release is (1) proportional to the total size of the ribbon footprint, and (2) less release is witnessed when ribbons are deleted (also see photo ablation studies by Snellman.... And Mehta..Zenisek, Nat Neurosci and Neuron).

      Thank you for these pointers into the literature.  We have included much of this work in the revised Discussion (see three paragraphs starting on line 751). The revised text focuses on the evidence that larger and more numerous ribbons lead to increased release. The direct evidence from previous work for this relationship supports our (indirect) conclusions in the current paper about the role of ribbon size and associated vesicle pools in transient vs sustained responses.  

      Ribbon morphology may change in an activity-dependent manner. The rod ribbon AZ has been reported to lengthen in the dark (Dembla et al 2020), and deletion of the ribbon shortens the length of the AZ (defined by Cav1,4 or RIM2); in addition, the Ribeye-ko AZs fail to change in size with light and dark conditioning. Furthermore, EM studies on rod and cone AZs in light and dark argue that the number of SVs at the base of the ribbon increases in the dark, when PRs are depolarized (see Figure 10, Babai et al 2016 JNeurosci). Lastly, using goldfish Mb1 on-bipolars, Hull et al (2006, J Neurophysio) correlated an increase in release efficiency with an increase in ribbon numbers, which accompanied daylight. >> When release activity is high, ribbon AZ length increases (Dembla, rods), the number of docked SVs increases (Babai, rods cones), and the number of ribbons increases (Hull, diurnal Mb1s). 

      We have extensively revised the discussion section to include more discussion of ribbons, particularly emphasizing evidence supporting the general argument that larger ribbons support higher release rates. We focused on studies that provided direct links between release rates and ribbon size or number of ribbon-associated vesicles.  This includes studies that pair electrophysiology and anatomy and those that measure the consequences of ablating ribbons,

      The results under review, Kuo et al., were attained with SBF-SEM, which has the benefit of addressing large-volume questions as required here, yet it achieves lower spatial resolution than what is attained with TEM tomography and FIB-EM. Ideally, the EM description would include SV size, and the density of ribbon-tethered SVs that are docked at the plasma membrane, because this is where the SVs fuse (additional non-ribbon release sites may also exist? Mehta ... Singer 2014 J Neurosci). Studies by Graydon et al 2011 and 2014 (both in J Neurosci), and Jean ... Moser et al 2018 (eLife) are good examples of quantitative estimates of SVs docking sites at ribbons. SBF-SEM does not allow for an assessment of SVs within 5 nm of the PM, but if the authors can identify the number of SVs that appear within the limit of resolution (10 to 15 nm) from the PM, then this data would be useful. Also, what dimension(s) of the large ribbons make them larger? Typically, ribbons are fixed in height (at least in the outer retina, 200 to 250 nm), but their length varies and the number ribbons per terminal varies. Is the larger ribbon size observed in type 6 bpcs do to longer ribbons, or taller ribbons? A longer ribbon likely has more docked SVs. An additional possibility is that more SVs are about the ribbon-PM footprint, either more densely packed and/or expanding laterally (see definitions in Jean....Moser, elife 2018). 

      We have included an additional analysis of ribbon surface area from our 3D SBFSEM reconstructions. As with the volume measurements included in the original submission, ribbon surface areas are distinct between type 5i and type 6 bipolar cells (Fig. S10A), ON-T RGCs on average receive input from ribbons with smaller surface area than ON-S RGCs (Fig. S10B), and ribbon surface area predicts the number of adjacent vesicles across bipolar cell types (Fig. S10C).  We agree that a higher resolution view of presynaptic structures would be very helpful, but the resolution of our SBF-SEM data is limited (e.g. each pixel is 40 nm on a side).  This resolution does not allow us to distinguish between vesicles at vs near the membrane. 

      In our observations, both length and height of the ribbons showed variability across individual bipolar cells. And ribbons in type 6 bipolar cells tended to be either longer and/or taller compared to those in type 5 cells. We agree that a longer ribbon may accommodate more docked SVs. A more definitive analysis would benefit from higher-resolution, isotropic 3D reconstructions of ribbons, which would allow more precise shape analysis and ,together with a detailed assessment of docked SVs at the ribbons.

      The ribbon literature given above makes the argument that ribbons increase exocytotic output, and morphological studies suggest that release activity enhances 1) ribbon length (Dembla) and 2) the density of SVs near the PM (Babai). These findings could lead one to propose that type 6 bpcs (inputs to On-sustained) are more active than type 5i (feed into On-transient). Here Kuo et al. show that the bpcs have similar Vm (measured from the soma) in response to light stimulation. Does Vm predict release? Not entirely as the authors acknowledge, because: Cav channel properties, SV availability, and negative feedback are all downstream of bpc Vm. The only experiment performed to test downstream factors focused on negative feedback from amacrines. The data presented in Figures 5C-F led me to conclude the opposite of what the authors concluded. My impression is that the T-ON rgc exhibits strong disinhibition when GABA-blockers are applied (the initial phase is greatly increased in amplitude and broadened with the drug), which contrasts with the S-On rgc responses that show a change in the amplitude of the initial phase but not its width (taus would be nice). Here and in many places the authors refer to changes in release kinetics, without implementing a useful description of kinetics. For instance, take the cumulative current (charge) in Figure 5C and fit the control and drug traces to arrive at taus, and their respective amplitudes, and use these values to describe kinetic phases. One final point, the summary in Figure 5D has a p: 0.06, very close to the cutoff for significance, which begs for more than an n = 5. Given that previous studies have shown that bpc output is shaped by immediate msec GABA feedback, in ways that influence kinetic phases of release (..Mb1 bipolars, see Vigh et al 2005 Neuron), more attention to this matter is needed before the authors rule out feedback inhibition in favor of ribbon size. If by chance, type 5i bpcs are under uniquely strong feedback inhibition, then ribbon size may result from less activity, not less output resulting from smaller ribbons.

      The text surrounding Figure 5 led to some confusion, and we have revised that text and the figure for clarity.  First, the data in that figure is entirely from On-T cells (the upper and lower panels show block of GABA and glycine receptors separately).  Second, the observation that we make there is that block of inhibitory receptors increases the transience of the On-T excitatory input, rather than decreasing it as would be expected if the transience is created by presynaptic inhibition. We have added additional data and that increase in transience is now significant. Inhibitory block does substantially increase the amplitude of the postsynaptic response, and a likely origin of this change in response is inhibitory feedback to the bipolar synaptic terminal. We now indicate this in the text on page 13, lines 438-453. 

      The key result of this figure for our purposes here is that the transience of the excitatory input to the OffT cell remains with inhibitory input blocked. We have clarified throughout the text that our results indicate that inhibitory feedback is not necessary for the difference between transient release into On-T and sustained release onto On-S. This does not mean that inhibitory feedback does not shape the responses in other ways or contribute to the transient/sustained difference - just that for the specific stimuli we use that difference is retained without presynaptic inhibition. We have also added citations to past work showing that activity of amacrine cells can modulate bipolar transmitter release. 

      Whether strong feedback inhibition limits activity and therefore limits ribbon size in an activity-dependent way is an intriguing possibility. Indeed, addressing why ribbons are larger in type 6 bipolar cells vs. other bipolar types will be an interesting avenue of further study. However, it would be surprising if ribbon sizes changed during the acute pharmacological block conditions (~10-15 minutes) we employed in our study. Our point here is that there is an interesting correlation between presynaptic ribbon size and the kinetics of glutamate release. We do not think that the two possibilities stated in the last sentence (“…ribbon size may result from less activity, not less output resulting from smaller ribbons”) are mutually exclusive.

      We have not further quantified the response kinetics in the experiments of Figure 5 as the large changes induced by the pharmacology (especially GABA receptor block) make it unclear how to interpret quantitative differences.  In other places we have quantified kinetics through the STA or specified that our focus was more qualitative (i.e. transient vs sustained kinetics). 

      As mentioned above, the behavior of Cav channels is important here. This is difficult to address with voltage clamps from the soma, especially in the Vm range relevant to this study. Given that it has previously been modeled that the rod bpc to AII pathway adapts to prolonged depolarization of rbcs through downregulating Cav channel-mediated Ca<sup>2+</sup> influx (Grimes ....Rieke 2014 Neuron), it seems important for Kou et al to test if there is a difference in Cav regulation between type 6 and 5i bpcs. Ca<sup>2+</sup>  imaging with a GCaMP strategy (Baden....Lagnado Current Biology, 2011) or filling the presynapse with Ca dyes (see inner hair cells: Ozcete and Moser, EMBO J 2020) would allow for the correlation of [Ca]intra with GluSnf signals (both local readouts).

      This is a good suggestion but is outside the scope of our current paper. Our focus was on the circuit origin of the difference in response of the OnT and OnS responses rather than the specific biophysical mechanism.  We are of course interested in the mechanism, but the additional experiments needed to pin that down would need to be a part of future experiments. The work here represents an important step in that direction as it greatly reduces the number of possible locations and mechanisms for the sustained/transient difference and hence serves to focus any future mechanistic investigations.

      Stimulation protocol and presentation of Glutamate Sniffer data in Figure 6. In all of your figures where you state steady st as a % of pk amplitude, please indicate in the figure where you estimate steady state. Alternatively, if you take the cumulative dF/F signal, then you can fit the different kinetic phases. From the appearance of the data, the Sustained Glu signals look like square waves (Figure 6B ROI1-4), without a transient at onset, which is not predicted in your ribbon model that assumes different kinetic phases (1. depletion of docked SVs, and 2. refilling and repriming). The Transient responses (Figure 6B ROI5-8) are transient and more compatible with a depressing ribbon scheme. If you take the cumulative, for all of the On-S and compare it to all of the On-T responses, my guess is the cumulative dF/F will be 10 to 20 larger for the S-On. Would you conclude that bpc inputs to On-S (type 6) release 20fold more SVs per 4 seconds on a per ribbon basis, and does the surface area of the type 6 bpcs account for this difference? From Figures 8B and D, the volume of the ribbon is ~2 fold greater for type 6 vs 5i, but the Surface Area (both faces of ribbon) is more relevant to your model that claims ribbon size is the pivotal factor. If making cumulative traces, and comparisons on an absolute scale is unfounded, then we need to know how to compare different observations. The classic ribbon models always have a conversion factor such as the capacitance of an SV or q size that is used to derive SV numbers from total dCm or Qcontent. See Kim ....et al von Gersdorff, 2023, Cell Reports. Why not use the Gaussian noise stimulus in Fig 6 as in Figure 1 and 2? 

      For iGluSnFR recordings, steady-state responses were measured from the mean fluorescence over the last 1 sec of the light step (2 sec duration) response. We have included this information in the figure caption and in the Methods. 

      There is a good deal of variability in the iGluSnR responses from one ROI to another, and the ROIs shown in the original submission had a less prominent transient component than many other ROIs. We have replaced this figure with another that is more representative of the average behavior across ROIs. The full range of behavior is captured in Figure 6C; it is clear across ROIs that glutamate release near ON-S dendrites shows both sustained and transient components. The new experiments in which we block amacrine cell activity also include a few more example ROIs from ON-S cells, and those also show both transient and sustained components.

      Your suggestion to integrate the iGluSnFR signals to compare to our structural analysis of ribbons is interesting. However, we are hesitant to make a quantitative comparison between the two without further experiments to validate how the iGluSnFR signals we measure relate to release of single vesicles. For example, a quantitative measure of release based on the iGluSnR experiments would require accounting for possible differences in the expression of the indicator - which could differ both in overall level and/or location relative to release sites. 

      This comment and one above highlight the importance of measures of ribbon surface area, which we now provide (Figure S10).

      Figure 7. What is the recovery time for mammalian cones derived from ribbon-based models? There are estimates from membrane capacitance studies. Ground squirrel cones take 0.7 to 1 sec to recover the ultrafast, primed pool of SVs when probed with a paired-pulse protocol (Grabner ...DeVries 2016, Neuron). Their off-bpcs take anywhere from under 0.2 sec to a second to recover, which is a combination of many synaptic factors (Grabner ...DeVries Nat Comm 2023). Rod On bpcs take over a second (Singer Diamond 2006, reviewed Wan and Heidelberger 2011). In Figure 7B, the recovery time is ~150 ms for the responses measured at rgcs. This brief recovery time is incompatible with existing ribbon models of release. Whole-cell membrane capacitance measurements would be helpful here.

      Thanks for drawing our attention to this issue. Indeed, we see a relatively rapid recovery in the paired-flash experiments. We now discuss this recovery time in the context of past measurements of recovery of responses in cones and bipolar cells (paragraph starting on line 773). There are many factors that could contribute to the relatively rapid recovery we observe - including synaptic factors such as those highlighted by Grabner et al., (2016) either at the cone-to-bipolar synapses or the bipolar-to-RGC synapses. We are certainly interested in a more detailed understanding of this issue, but the additional experiments are outside the scope of this paper.  

      Experimental Suggestion: Add GABA blockers and see if type 5i bpc responds with more release (GluSniff) and prolonged [Ca2+] intra (GCaMP). Compare this to type 6 bpc behavior with GABA/gly blockers. This will rule in or out whether feedback inhibition is involved. 

      Figure 7 in the revised manuscript includes two new experiments examining glutamate release (without the simultaneous measurement of bipolar cell intracellular calcium) while blocking (1) all/most amacrine cell-mediated inhibition via inclusion of NBQX in the bath solution, and (2) blocking spiking amacrine cells via inclusion of TTX in the bath solution. The transient vs sustained difference in light-evoked glutamate release around ON-T and ON-S RGC dendrites remained with amacrine activity suppressed. These new results are consistent with the anatomical and pharmacological data that were included in the initial submission of the manuscript (Fig. 5) that indicate presynaptic inhibition does not have a major role in shaping release kinetics at these synapses. 

      Reviewer #3 (Public Review): 

      Summary: 

      Different types of retinal ganglion cell (RGC) have different temporal properties - most prominently a distinction between sustained vs. transient responses to contrast. This has been well established in multiple species, including mice. In general, RGCs with dendrites that stratify close to the ganglion cell layer (GCL) are sustained; whereas those that stratify near the middle of the inner plexiform layer (IPL) are transient. This difference in RGC spiking responses aligns with similar differences in excitatory synaptic currents as well as with differences in glutamate release in the respective layers - shown previously and here, with a glutamate sensor (iGluSnFR) expressed in the RGCs of interest. Differences in glutamate release were not explained by differences in the distinct presynaptic bipolar cells' voltage responses, which were quite similar to one another. Rather, the difference in transient vs. sustained responses seems to emerge at the bipolar cell axon terminals in the form of glutamate release. This difference in the temporal pattern of glutamate release was correlated with differences in the size of synaptic ribbons (larger in the bipolar cells with more sustained responses), which also correlated with a greater number of vesicles in the vicinity of the larger ribbons. 

      The main conclusion of the study relates to a correlation (because it is difficult to manipulate ribbon size or vesicle density experimentally): the bipolar cells with increased ribbon size/vesicle number would have a greater possibility of sustained release, which would be reflected in the postsynaptic RGC synaptic currents and RGC firing rates. This model proposes a mechanism for temporal channels that is independent of synaptic inhibition. Indeed, some experiments in the paper suggest that inhibition cannot explain the transient nature of glutamate release onto one of the RGC types. Still, it is surprising that such a diverse set of inhibitory interneurons in the retina would not play some role in diversifying the temporal properties of RGC responses. 

      Strengths: 

      (1) The study uses a systematic approach to evaluating temporal properties of retinal ganglion cell (RGC) spiking outputs, excitatory synaptic inputs, presynaptic voltage responses, and presynaptic glutamate release. The combination of these experiments demonstrates an important step in the conversion from voltage to glutamate release in shaping response dynamics in RGCs. 

      (2) The study uses a combination of electrophysiology, two-photon imaging, and scanning block-face EM to build a quantitative and coherent story about specific retinal circuits and their functional properties. 

      Weaknesses: 

      (1) There were some interesting aspects of the study that were not completely resolved, and resolving some of these issues may go beyond the current study. For example, it was interesting that different extracellular media (Ames medium vs. ACSF) generated different degrees of transient vs. sustained responses in RGCs, but it was unclear how these media might have impacted ion channels at different levels of the circuit that could explain the effects on temporal tuning.

      We do not have an explanation for the quantitative differences in response kinetics we observed in Ames’ medium vs. ACSF. There are modest differences in calcium and magnesium concentration and a larger difference in potassium (2.5 mM in ACSF vs 3.6 mM in Ames). It would be interesting to test which of these (or other) differences accounts for the difference in response kinetics.

      (2) It was surprising that inhibition played such a small role in generating temporal tuning. At the same time, there were some gaps in the investigation of inhibition (e.g., IPSCs were not measured in either of the RGC types; pharmacology was used to investigate responses only in the transient RGCs).

      We were also surprised at this result. We have included additional data on inhibition in the revised manuscript. Figure S3 shows light-evoked IPSC data from both RGC types (Fig. S3) and Fig. 7 shows additional iGluSnFR measurements around both ON-T and ON-S RGC dendrites with inhibition blocked via bath application of NBQX (Fig. 7) and separately with inhibition from spiking amacrine cells blocked with TTX. These experiments provide additional evidence for the small role of inhibition. We attempted to measure the kinetics of excitatory input to ON-S cells with inhibition blocked, but we found that the excitatory input showed strong spontaneous oscillations under these conditions and the light responses were changed so drastically that we did not feel we could make a clear comparison with control conditions.

      (3) There could be additional discussion and references to the literature describing several topics, including: temporal dynamics of glutamate release at different levels of the IPL; previous evidence that release sites from a single presynaptic neuron can differ in their temporal properties depending on the postsynaptic target; previous investigations of the role of inhibition in temporal tuning within retinal circuitry. 

      Thanks, we have included more discussion and references to the relevant literature as you have suggested in the recommendations to authors.

      Reviewer #1 (Recommendations For The Authors): 

      The presented raw data of the pharmacological experiments show that SR95531 and TPMPA robustly increased both the amplitude and duration of the transient component of the light step-evoked excitatory currents, with slight, if any enhancement of the sustained component in ON-T RGCs Figure 5C. Statistical analysis of the population data (n=5) with Wilcoxon signed rank test yielded no significant difference (ln 363). However, reanalyzing the data extracted from the graph (Figure 5D) revealed that the difference between the paired observations is normally distributed (Shapiro-Wilk normality test, P=0.48) allowing parametric statistics to be used, which provides higher statistical power. Accordingly, reanalyzing the presented data with paired Student's t-test data revealed significant differences (P=0.01) in the steady-state amplitude normalized to that of the peak, recorded in the presence of SR95531 and TPMPA. In other words, based on the (rough) analysis of the presented pharmacology data GABAergic feedback inhibition significantly contributes to shaping the transient portion of the light-evoked excitatory currents in ON-T RGCs, by making it more transient. I believe a similar analysis based on the actual data is necessary, and the results should be communicated either way. However, if warranted, two-photon glutamate sensor imaging experiments showing that blocking GABA- and glycinergic inhibition does not change the kinetics of light-evoked glutamate signals at ON-T RGCs should also be performed, as these would be critical in drawing a conclusion regarding the effect of feedback inhibition on glutamate release from bipolar cells.

      Thanks for this feedback. We have added another cell to the data set in Fig. 5D. With this addition, SR95531/TPMPA application significantly increases the response transience of excitatory currents measured in ON-T RGCs compared to control. This enhanced transience in GABA<sub>A/C</sub> receptor blockers is due to an increase in the amplitude of the initial peak component of the response (control peak amplitude: -833.7±103.3 pA; SR95531+TPMPA peak amplitude: 2023±372.7pA; p=0.03, Wilcoxon signed rank test), with no change to the later sustained component (control plateau amplitude: -200.7±14.71pA; SR95531+TPMPA plateau amplitude: -290.9±43.69pA; p=0.15, Wilcoxon signed rank test).

      We should clarify that this result indicates that GABAergic inhibition makes the excitatory inputs to ON-T RGCs less transient. Block of GABA receptors increased transience, thus intact GABAergic transmission appears to limit the initial peak of the response and therefore make excitatory currents more sustained. We unfortunately were not able to examine whether sustained excitatory currents in ON-S RGCs would become more transient using the same approach. In our hands, bath application of SR95531+TPMPA led to the generation of large-amplitude (>1nA) oscillatory bursts of excitatory input that developed within 5 minutes and persisted for the duration of the incubation (up to ~30 min) in drugs. Further, presentation of light steps tended to induce variable amplitude responses, likely dependent on the presence of spontaneous bursts; when large amplitude responses were evoked, these typically oscillated for several seconds after the step.

      To examine a potential role for presynaptic inhibition in transient vs. sustained bipolar cell output, we therefore chose to eliminate amacrine cell-mediated inhibition by bath application of the AMPA/kainate receptor antagonist NBQX in additional iGluSnFR measurements. This manipulation should leave ON bipolar cell responses intact while eliminating most amacrine cell-mediated responses (and OFF bipolar cell driven responses). In separate experiments, we also eliminated inhibition from spiking amacrine cells by bath application of TTX. As shown in new Fig. 7, sustained and transient responses persisted in distal versus proximal RGC dendrites, respectively. Compared to SR95531/TPMPA, bath application of NBQX was not associated with spontaneous bursts of glutamate release around ON-S dendrites. These results show that amacrine cell-mediated inhibition is not required for either sustained or transient glutamate release from bipolar cells that provide input to ON-S and ON-T RGCs.

      Small points: 

      (1) The legend of Figure 1 (D) refers to shaded areas to show {plus minus} SEM, but no shade is visible (at least in my printout).

      The SEM shading is there in Fig. 1D but is mostly obscured by the mean lines for the respective RGC types. We have added this to the figure caption.

      (2) I found the reported Vrest for the ON bipolar cells somewhat depolarized. Perhaps due to the uncompensated junction potentials? 

      These measurements are indeed not corrected for the liquid junction potential (which is approximately -10.8 mV between K-gluconate internal and Ames’ solution). We did not apply this correction since the appropriate value is not clear in perforated patch recordings as the intracellular chloride concentration is unknown (and can differ from that in the pipette solution). We have clarified this in the results text where we describe the Vrest values (lines 335-338).

      (3) It is Wilcoxon signed rank test, not Wilcoxan. 

      Thanks for catching this. This has been corrected in the revised manuscript.

      Reviewer #2 (Recommendations For The Authors): 

      Some amacrines express vesicular Glut-3 transporter and are reported to release glutamate (Marshak, Vis Neurosci 2016). Are Amacrine vGlut3 signals postsynaptic (within ~0.5 um) to cone bpc ribbons?

      We did not characterize VgluT3-expressing amacrine cells in our SEM datasets. A recent study by Friedrichson et al. (Nat. Comm. 2024; PMID 38580652) using 3D SEM reconstructions found that Vglut3-amacrines are postsynaptic to both type 5i and type 6 bipolar cells, as well as other type 5/xbc bipolar cells (and receive >50% of their input from type 3a OFF bipolar cells).

      How far apart are the postsynaptic targets from the ribbon release sites? The ribbons at type 5i bpc/On-T input appear separated from the dendrites of On-T rgcs (Figure 8C). At least further away than the type 6 bpc ribbons are from On-S rgc dendrites (Figure 8C). Distance may create a thresholding phenomenon, whereby only multivesicular bouts at the onset of depolarization are able to elevate synaptic Glu to levels needed to activate On-T GluRs. See Grabner et al Nat Comm 2023 for such scenarios in the outer retina.

      This is an intriguing possibility, but we should point out that the presynaptic ribbons in Fig. 9C (former Fig. 8C) are similar distances (within the resolution of our reconstructions) from the ON-T and ON-S dendrites. We have increased the brightness of the dendrite segments for both RGC types in the resubmission figure; note that ON-T RGCs have spine-like protrusions that may not have been as apparent in the previously submitted version of our manuscript.

      In Figures 1 and 2, Sustained responses look like the derivative of Transient responses, minus the negative going inflection. In addition, the sustained responses appear to have a lower threshold of activation than the transient On rgcs, because there are more bouts of action potentials (and membrane depol in V-clamp) with earlier onset in sustained than transients traces. It would be great if the GLuSniff data captured these differences. Take cumulative dF/F and see what the onset time is, or an initial tau if possible.

      This is a good suggestion. However, we are reluctant to make detailed quantitative comparisons such as this without further validation of how the kinetics of the iGluSnFR signals relate to kinetics of glutamate release.  A specific concern is that differences in the location and amount of iGluSnFR expression could impact any such comparisons.

      A recent study by Kim et al von Gersdorff (Cell Reports, 2023) presents interesting phases of release in response to light flashes, measured from AIIs, and complementary results from pairs of rbcs-AIIs. The findings highlight the complexity of SV pools under well-controlled experiments. Could their results be explained as variations in rbc ribbon size through development, and possibly between rbcs or within an rbc? 

      This certainly seems possible and would be consistent with the dependence of release on ribbon size that our results support.  It would be interesting to see if there are clear anatomical correlates of that change in release properties.  

      Figure 5 is a pivotal point in the study, but my review has identified numerous weaknesses. The feedback inhibition onto bipolar cell terminals is likely to sculpt glutamate release, and the results do not convincingly rule out this possibility. The suggestions for improvements range from the data needing to be reanalyzed with regard to statistical tests, and/or adding a few more data points (n = 5) before concluding a p: 0.06 is insignificant. 

      We have added an additional recording to this data set. With n= 6 cells, there is now a statistically significant difference between ON-T RGC excitatory currents measured in control conditions versus during GABA<sub>A/C</sub> receptor blockade. Please note that all the recordings shown in Figure 5C-F are from ON-T RGCs (the two panels show separately block of GABergic and glycinergic receptors). We did not make it sufficiently clear that the original trend (now statistically significant) is opposite of that expected if presynaptic GABAergic inhibition contributes to response transience in ON-T RGCs.  What we see is that excitatory synaptic inputs to ON-T RGCs become more transient (rather than mpre sustained) during GABA<sub>A/C</sub> receptor blockade. We have revised the text in that section to make this point more clearly.

      We have also included new data from iGluSnFR measurements showing that bath application of NBQX does not affect light step-evoked glutamate release kinetics at proximal (sustained) or distal (transient) RGC dendrites (control: steady-state amp. as % of peak amp. 13 ± 10; mean ± S.D.; n = 189 ROIs/4 FOVs for ON-T dendrites vs 40 ± 12; mean ± S.D.; n = 287 ROIs/8 FOVs for ON-S dendrites; NBQX: 6 ± 3; mean ± S.D.; n = 112 ROIs/1 FOV for ON-T dendrites vs 23 ± 9; mean ± S.D.; n = 97 ROIs/2 FOVs for ON-S dendrites; *p<0.001). By blocking glutamate receptors on amacrine cells, NBQX (AMPA/KAR antagonist) eliminates all/most amacrine cell-mediated signaling in the retina and should therefore abolish presynaptic inhibitory input to bipolar cell terminals across the IPL. Taken together, our results indicate that presynaptic inhibition does not play a critical role in establishing transient versus sustained kinetics for the stimulus conditions we employed in our study.

      There is a need to cite more recent literature on bipolar cell ribbons (e.g. see Wakeham et al., Front. Cell. Neurosci., 2023), in order to support experimental design and interpretation of the results. The authors should discuss their Ribeye-KO data from Okawa et al 2019 Nat Comm, Figure 7, in the context of their new iGluSnFR results. 

      Thank you for prompting us on this issue. We have expanded the discussion regarding ribbons and included more citations to the ribbon literature. That is largely in the three paragraphs starting on line 727.

      One point deserves emphasis because it is central to the authors' ribbon model but not consistent with their data. The ribbon model as they put it, and as commonly stated, holds that a transient phase of release at the onset of depolarization indicates the depletion of the primed SVs, and the subsequent slower rate of release (steady state release in the authors' terms) reflects recruiting, priming, and release of new SVs. The On-transient dendrite GluSnf responses agree with this multiphasic process, but the sustained responses show only an elevation in glutamate without a pronounced initial peak, creating a square-wave-shaped response (Figure 6B). This does not agree with the simple ribbon-based release model. I would expect the signals from the T- and S-on dendrites to have a comparable initial phase, while the sustained phase should be greater in amplitude for the S-on dendrites. More discussion may clarify possible mechanisms.

      Thanks for pointing this out. The example iGluSnFR traces we originally included in the manuscript were not entirely representative in that they did not show much initial transient phase. Note there is a distribution of steady-state amplitudes for proximal dendrites in Fig. 6C; the examples are from ROIs from the upper end of the distribution. In the new Figure 7, we have included some additional examples that show both a clear transient and sustained component. The summary data in Figure 6C shows the distribution of sustained/transient ratios across ROIs.  

      Reviewer #3 (Recommendations For The Authors): 

      (1) It would be interesting to understand the differences in IPSCs in the two RGC types. Perhaps they are small in both types, which would explain their apparent lack of impact on temporal tuning. The authors may already have these data.

      We did make measurements of noise-evoked IPSCs (as well as EPSCs) in a subset of ON-T and ON-S recordings. We have now included this data as Figure S3. There are slight differences in the kinetics of inhibition between RGC types (Fig. S3C) and there is a trend towards stronger inhibition (relative to excitation) in ON-T RGCs compared to ON-S RGCs (Fig. S3E), although there is not a statistically significant difference. In both cases excitatory synaptic currents are as large or larger than inhibitory currents, and this does not include the difference in driving force near spike threshold which will favor excitatory input by a factor of 2-3.  Hence our data suggests that postsynaptic inhibition does not play a major role in generating the differential temporal spiking responses of ON-T and ON-S RGCs. However, additional experiments examining the relative contribution of excitation and inhibition to spiking output in these RGCs would be needed to reach a firm conclusion.

      The pharmacological experiments in which we blocked inhibition (Fig. 5C-F, new Fig. 7) were designed to test the effect of presynaptic inhibition on bipolar cell output (voltage-clamp isolation of excitatory currents in Fig. 5; iGluSnFR measurements of glutamate release in Fig. 7). We do not mean to suggest that postsynaptic inhibition does not have any role in shaping the spiking behavior of these RGC types, but that transient vs. sustained kinetics are already present in the bipolar cell output and that presynaptic inhibition of bipolar cell terminals does not appear to account for this difference.  We have revised the text throughout to be clearer on this point.

      (2) It could be convincing to show transient/sustained differences between RGC types in dim light, where the response would depend on the rod bipolar/AII circuit. In this case, any difference in temporal properties would presumably be explained by differences that localize to the cone bipolar cell axon terminals. Indeed, is that the result in Figure 1B? This seems to be a dim stimulus presented on darkness, which may be driven through the rod bipolar pathway. The authors could then discuss the interpretation of this data in terms of the rod bipolar circuit. 

      Yes, Figure 1B is a dim light step (~30R*/rod/s) presented from darkness and the distinction between cells is clear down at still lower light levels that more effectively isolate signaling through the rod bipolar pathway. Thanks for making this point that observation of distinct temporal responses under scotopic conditions where signals suggests these differences must arise at and/or downstream of cone bipolar cell output. We have included additional text (lines 361-365) in the results describing bipolar cell responses that raise this point.

      (3) Glutamate release was already measured across the full IPL depth by Borghuis et al. (2013) and Franke et al. (2017). It would be appropriate to better motivate the current study based on these existing measurements.

      We have clarified that these important studies provided important motivation for measuring excitatory synaptic input to ON-T vs. ON-S RGCs (lines 165-169).   

      (4) Line 212/213. It would be appropriate to add to the list of papers showing the different stratification of transient vs. sustained responses: Borghuis et al. (2013) and Beaudoin et al. (2019).

      Thank you - these references have been added.  

      (5) Line 635-638. It would be useful to discuss papers by Pottackal et al. (2020, 2021), which suggested that a single presynaptic cell (starburst) can signal with different temporal properties depending on the postsynaptic target (other starburst vs. DSGCs). The mechanism was not completely resolved (i.e., it was not explained by differences in presynaptic Ca channels at the two synapse types), but it at least shows that neurotransmitter release can show different filtering depending on the postsynaptic target from the same presynaptic neuron. (This could also be at play for the type 6 bipolar cell inputs to ON-S vs. ON-T RGCs in the present study.)

      We have added a reference to Pottackal et al 2021 in this section.

      (6) Line 714. Should describe the procedure for embedding the tissue in agarose. 

      We have added more detail regarding agarose embedding for preparation of retinal slices in the methods.

      (7) Line 775. Need a better description of the virus (not the construct), what serotype? Provide the Addgene number if available. 

      This has been added to the methods.

      (8) Line 808. Was the SD for the gaussian really 50%? That would cut off a lot of the distribution, i.e., it would get clipped at 0. 

      Yes, the SD for Gaussian noise was 50%. This high contrast stimulus was used in part to achieve measurable signals from bipolar cells. You are correct that some of the distribution was clipped at 0 (it was also clipped at twice the mean to make sure that the distribution remained symmetrical). The clipping was accounted for during our LN analyses.

      (9) The paper should discuss Swygart et al. (2024) results showing different spatial surround properties of neighboring synapses from a type 6 bipolar cell. Based on this result, it would seem very likely that amacrine cells could play a role in shaping the temporal processing of bipolar cell glutamate release as well. Indeed, spatial and temporal processing will not be completely independent in a typical experiment. For example, with the spot stimulus used in the present study, bipolar cells within the center versus the edge of the spot will have different balances of center/surround activation, which could potentially influence their temporal processing.

      We have included discussion of results from Swygart et al 2024 in the section of the Discussion in which we point out differences in surround inhibition between ON-S and ON-T RGCs (lines 710-714). We agree that spatial and temporal processing are not completely independent. Our results with SR95531/TPMPA indicate ON-T RGCs receive stronger GABAergic surround inhibition than ON-S RGCs (Fig. S8). However, our results in Fig. 5C-D show GABAergic surround inhibition makes ON-T excitation more sustained rather than more transient. So even though bipolar cells presynaptic to ON-T RGCs receive stronger surround inhibition (Fig. S8), this inhibition does not establish the transient kinetics of glutamate release from these bipolar cells (in fact, it works to make release more sustained). Additional iGluSnFR experiments where we used NBQX to block all/most amacrine cell-mediated responses also suggest presynaptic inhibition does not have an important role in establishing differential glutamate release kinetics onto ON-S vs. ON-T RGC dendrites (Fig. 7).

      (10) Cui et al. 2016 described ON-S Alpha as having a divisive suppression mechanism that explained the temporal properties of white-noise response better than a standard LN model. Do the authors think the divisive suppression reflects a property of the excitatory synapses independent of inhibition?

      This is an interesting question, but one for which we don’t have a good answer for now. As mentioned in some of the above responses and as we have tried to clarify in the manuscript, we do not mean to imply that there is no role for presynaptic inhibition in modulating bipolar cell output, including for the divisive suppression described by Cui et al. Rather, our point is that the distinction between transient and sustained excitatory input to ON-T and ON-S RGCs does not require presynaptic inhibition and is more likely an intrinsic property of the bipolar cell synapses. 

      (11) Do the authors mean to imply that the pool size at bipolar cell ribbon synapses could depend on the use of Ames vs. ACSF? 

      For now, we do not have a good answer as to why there are quantitative differences in response kinetics between Ames and ACSF. We have not done any experiments to investigate whether ribbon sizes or ribbon pools are different in the different solutions.

      (12) More generally, different mean luminance levels could drive different levels of baseline glutamate release, which could alter the available pool of vesicles at bipolar cell ribbon synapses. Can we explain varying degrees of transient/sustained in the same cell at different levels of mean luminance based on this mechanism (e.g., Grimes et al., 2014)?

      Yes, the emergence of a transient component of excitatory input to ON-S RGCs at ~100 R*/rod/s versus at scotopic levels (0.5 R*/rod/s) in Grimes et al. (2014) could be due to differences in the number of releasable vesicles (due to different type 6 bipolar cell axon terminal membrane potentials and hence differences in spontaneous release rates) at the different light levels.

      We should note that although ON-T and ON-S RGCs exhibit some changes in transient/sustained kinetics across different light levels, the relative differences between these RGC types are preserved across light levels. We have included a statement about this in the text (lines 361-367).

      (13) Figure 1. Have the authors considered performing the LN analysis of the firing responses, to compare the degree of rectification between the two RGC types?

      This is a good suggestions. From an LN analysis of spiking responses, we do not observe a clear difference between the static nonlinearity component of the model for ON-T and ON-S RGCs. Both RGC types are strongly rectified under our experimental conditions.  

      (14) Figure 5. Do the authors have the pharmacology data for the ON-S cells? There are examples of sustained EPSCs in amacrine cells that become more transient after blocking inhibition, which at least suggests that inhibition can play some role in the transient/sustained nature of glutamate release (Park et al., 2015, Figure 3). Perhaps ON-S cells likewise become more transient with inhibition blocked. 

      (The colored symbols in A were not visible in a printout. It would be useful to indicate the cell type (ON-T) in C and E). 

      As described above in the response to reviewer 1’s recommendation for authors, we were not able to use SR95531/TPMPA for recordings from ON-S RGCs. Bath application of these drugs led to oscillatory bursts of excitatory input to ON-S RGCs. However, the lack of effect of bath-applied NBQX on the kinetics of glutamate release around either ON-T or ON-S RGC dendrites (new Fig. 7) suggests that presynaptic inhibition does not contribute to generating sustained excitation to ON-S RGCs (or transient excitation to ON-T RGCs).  

      We have corrected Fig. 5A to include the referenced colored symbols and have also edited Fig 5C and E to clarify that measurements in Fig. 5C-F are from ON-T RGCs.

      (15) Figure 6 legend. Should be Kcng4-Cre, not KCNG-Cre. Also, it should make clear that this is cre-dependent expression of iGluSnFR. For C, were the statistics based on the number of FOVs? 

      Thanks for catching this, we have corrected Figure 6 legend. The methods section includes a description of how we achieved iGluSnFR expression on alpha RGC dendrites via a cre-dependent viral strategy in Kcng4-Cre mice.  We have also clarified that the statistics are based on ROIs in Figure 6C.

      (16) Figure 7, Flashes were apparently 400% contrast on a dim background. What was the background? Is there a rod component to the response in this case? 

      In Figure 7 (now Figure 8), the same background (~3300 R*/rod/s; 2000 P*/Scone/s) was used as in the Gaussian noise and step response experiments. At this light level, the response should be primarily be mediated by cones.

      (17) Figure S1. The colors here differ from those in previous figures (Here, ON-T, magenta; ON-S, cyan). Is something mislabeled? 

      Thanks for catching this. We mistakenly swapped the labels in the legend for Fig. S1. The figure colors were correct, but we have corrected the legend in the revised manuscript.

      (18) Figure S2. For the LN model for RGC synaptic currents, the ON-S are more rectified than some previous recordings (Cui et al., 2016). Is this perhaps explained by different light levels?

      We aren’t sure why ON-S excitatory currents are more strongly rectified in our recordings compared to Cui et al., 2016. Cui et al. used an ~20-fold higher background light intensity (~40,000 P*/cone/s vs. ~2000 P*/cone/s in our study), so different light levels may be a factor (although we should point out that rectification increases in these RGCs between scotopic to low photopic light levels (see Grimes et al., 2014 and Kuo et al., 2016).

      (19) The study is apparently comparing PV1 and PV2 described in Farrow et al. (2013; see Supplementary information for stratification analysis), which should be cited.

      Thanks, we have corrected this oversight in the revised manuscript. We now cite Farrow et al and mention the connection to PV1 and PV2 in the first paragraph of Results (lines 104-108).

    1. Note: This response was posted by the corresponding author to Review Commons. The content has not been altered except for formatting.

      Learn more at Review Commons


      Reply to the reviewers

      Reviewer #1

      Major comments:

      (comment #1)- It is interesting that TRF2 loss not only fails to increase γH2AX/53BP1 levels but may even slightly reduce them (e.g., Fig. S2c and the IF images). While the main hypothesis is that TRF2 loss does not trigger telomere dysfunction in NSCs, this observation raises the possibility that TRF2 itself contributes to DDR signaling (ATM-P, γH2AX, 53BP1) in these cells and that in its absence, cells are not able to form those foci. To exclude the possibility that telomere-specific DDR is being missed due to an overall dampened DDR response in the absence of TRF2, it would be informative to induce exogenous DSBs in TRF2-depleted cells and test DDR competence (e.g., IF for γH2AX/53BP1). In other words, are those NSC lacking TRF2 even able to form H2AX/53BP1 foci when damaged? In addition, it would be interesting to perform telomere fusion analysis in TRF2 silenced cells (and TRF1 silenced cells as a positive control).

      We acknowledge a slight reduction; however, this difference is not statistically significant (Fig S2c,e). We will quantify the levels of DDR markers upon TRF2 loss and exogenous DSBs and include it in the subsequent revision.

      (comment #2)-A TRF2 ChIP-seq should be performed in NSC as this list of genes (named TAN genes in the text) was determined using a ChIP performed in another cell line (HT1080). For the ChIP-qPCR in the various conditions, primers for negative control regions should be included to show the specific binding of TRF2 to the promoter of the genes associated with neuronal differentiation. For example, an intergenic region and/or promoters of genes that are not associated with neuronal differentiation (or don't contain a potential G4). The same comment goes true for the gene expression analysis: a few genes that are not bound by TRF2 should be included as negative controls to exclude a potential global effect of TRF2 loss on gene expression (ideally a RNA-seq would be performed instead). We have performed NSC-specific TRF2 ChIP-seq for an upcoming manuscript, which confirms TRF2 occupancy at multiple promoters of differentiation-associated genes. These data are provided solely for confidential evaluation by the designated reviewers.

      Regarding the ChIP-qPCR control experiments: We thank reviewer for pointing this out, indeed we included controls in our PCR assays as positive (telomeric) and TRF2-nonbinding loci (GAPDH, RPS18, and ACTB, based on HT1080 TRF2 ChIP-seq data) as negative controls. These results were not included earlier for clarity given that we were presenting several ChIP-PCR figures - in response to the comment we have included this now in the revised version (Fig. S3d,e). Gene expression analyses show selective upregulation of the TAN genes upon TRF2 loss (data normalised to GAPDH); whereas negative control genes lacking TRF2 binding (RPS18, ACTB) remain unchanged, ruling out non-specific effects. (Fig S3f,g,j,k).

      -(comment #3) A co-IP should be performed between the TRF2 PTM mutant K176R or WT TRF2 and REST and PRC2 components to directly show a defect of interaction between them when TRF2 is mutated (a co-IP with DNase/RNase treatment to exclude nucleic-acid bridging). The TRF2 PTM mutant T188N also seems to lead to an increased differentiation (Fig. S5a). Could the author repeat the measure of gene expression and co-IP with REST upon the overexpression of this mutant too?

      We confirm that DNase/RNase is routinely included in our pull-down experiments to exclude nucleic-acid bridging, with detailed methodology now elaborated in the Methods section. Not including this in the manuscript Methods was an oversight from our side. Our data demonstrate that only REST directly interacts with TRF2, while TRF2 engages PRC2 indirectly via REST, as also previously shown by us and others (page 6; ref. [62]; Sharma et al., ref. [15]).

      We thank the reviewer for noting the apparent differentiation in Fig. S5a. However, this observation represents rare spontaneous differentiation event and is not statistically significant (as shown in Fig S5b). Consistently, gene expression analysis of the TRF2-T188N mutant shows no significant change in TRF2-associated neuronal differentiation (TAN) genes. Therefore, Co-IP for TRF2-T188N with REST was not done.

      (comment #4) - The authors show that the G4 ligands SMH14.6 and Bis-indole carboxamide upregulate TAN genes and promote neuronal differentiation, but the underlying mechanism remains unclear. Bis-indole carboxamide is generally considered a G4 stabilizer, while SMH14.6 is less characterized and should be better introduced. The authors should clarify how G4 stabilization would interfere with TRF2 binding, it seems that it would likely be by blocking access. A more detailed discussion, and ideally TRF2 ChIP after ligand treatment and/or G4 helicase treatment, would strengthen the model.

      We clarify that Bis-indole carboxamide acts as a G4 stabilizer, while SMH14.6 is also a noted G4-binding ligand that stabilizes G4s (ref. [15]). The exclusion of TRF2 from G4 motifs in gene promoters by G4-binding ligands has also been documented previously (ref. [18]). In line with these findings, ChIP experiments performed following ligand treatment revealed a decreased occupancy of TRF2 at TAN gene promoters, supporting the proposed mechanism (added Fig. 6h).

      Minor comments:

      • Supp Figures related to the scRNA-seq are difficult to read (blurry).

      Corrected

      • Fig S1h: The red box mentioned in the legend is not visible

      Corrected

      • In the text, the Figures 1 f-g are misannotated as Fig 1m and l

      Corrected

      • The symbol γ of γH2AX is missing in the text

      Corrected

      • Fig.3d, please indicate in the legend that it is done in SH-SY5Y.

      Added SH-SY5Y in the legend of Fig. 3d.

      • Fig. S3b: Please consider replotting this panel with an increased y-axis scale. As currently presented, the TRF2 ChIP-seq peaks at several promoters appear truncated by the scaling.

      Corrected

      Reviewer #2 (Evidence, reproducibility and clarity (Required)):

      1. For most of the data graphs in the manuscript, there is no indication of the number of independent biological replicates carried out (which should ideally be plotted as individual dots overlaying the column graphs), or what the error bars represent, or what statistical test was used. All the figure legends and methods have now been updated with the corresponding biological replicates per experiment, with error bars as SD/SEM and the corresponding statistical test along with p values.

      Figure S1.1a: needs a marker to show that the tissue is dentate gyrus.

      We acknowledge the reviewers' concern that high-magnification images alone make it difficult to verify whether the fields are taken from the correct anatomical location. The dentate gyrus (DG) of the hippocampus is a well-defined structure. In the revised figure (Fig S1.1a), we now include a low-magnification image showing the entire hippocampus, including the CA fields, along with two high-magnification fields specifically from the DG region. Consistent with our claim, the co-immunostaining demonstrates that Sox2-positive neural stem cells in the DG are also positive for TRF2.

      Figure 1c (and all other flow cytometry panels throughout the manuscript): it is not clear if the expression of any of these proteins, except maybe MAP2, are significantly different in the presence or absence of TRF2. These differences need to be presented more quantitatively, with the results compiled from multiple biological replicates and analysed statistically. I am not sure that flow cytometry is the best way to determine differences in protein expression levels for non-surface proteins, because many of the reported differences are not at all convincing.

      To detect intracellular/nuclear proteins by flow cytometry, cells were permeabilized using pre-chilled 0.2% Triton X-100 for 10 minutes, as described in the Methods section.

      We have revised the figures (Fig 1c,e) and now included statistical analysis from three independent biological replicates for these experiments.(Fig S1.4h-j, S2e, S6d)

      Fig 1d: has TRF2 been effectively silenced in this experiment? There appears to be just as many TRF2+ nuclei in the "TRF2 silenced" panel vs the control, including in the cells with neurite outgrowths.

      Quantification of nuclear levels of TRF2 showing decrease in nuclear TRF2 has been included in supplementary Fig S1g.

      Fig 2a-c: these experiments need a positive control, showing increased expression of these proteins in mNSC and SH-SY5Y cells in response to a DNA damaging agent. Again, flow cytometry may not be the best method for this; immunofluorescence combined with telomere FISH would be more convincing.

      We confirm that doxorubicin induces 53BP1 foci (IF-FISH Sup Fig. S2b) and TRF1 silencing elevates γH2AX (Sup Fig. S2c) validating DDR sensitivity. Unlike TRF2 loss (Fig. 2a-c), no TIFs appear with IF and telomere probes (Fig. 2d, Sup Fig. 2a), and without TIFs, there is no telomeric fusion. Flow cytometry was performed with Triton X- 100 to target nuclear protein. These findings adequately address the concern; therefore, further IF-FISH experiments were not included in the present study.

      To conclude that telomere damage is not occurring, an independent marker of such damage, such as telomere fusions, should also be measured.

      In response to uncapped telomeres, ATM kinase activates the DNA damage response (DDR), recruiting γH2AX and 53BP1 to telomeres, which precedes the end-to-end fusions (Takai et al., 2003; Maciejowski & de Lange, 2015; Takai et al., 2003; d'Adda di Fagagna et al., 2003; Cesare & Reddel, 2010; Hayashi et al., 2012; Sarek et al., 2015). We observe no DDR activation or foci (Fig. 2; Sup. Fig. 2). This absence of a DDR response and TIFs indicates no telomere uncapping, negating the need for direct telomere fusion analysis.

      Figure S2b is lacking a no-doxorubicin control.

      Untreated control has been included Fig. S2b.

      Figures 3a and 3b need a positive control (e.g. TRF2 binding to telomeric DNA) and a negative control (e.g. a promoter that did not show any TRF2 binding in the HT1080 ChiP-seq experiment in Fig S3).

      We have included positive (telomere) and negative (GAPDH) controls (based on HT1080 TRF2 ChIP-seq data) for the TRF2 ChIP assay in Supplementary Fig. S3d,e. Additionally, positive and negative controls for all ChIP experiments conducted in this study are presented in Supplementary Figs. S3d, S3e, S3h, S3i, S4c-h, and S5c-e

      The data in Figure 3 would be more compelling if all experiments were also performed in fibroblasts to confirm the cell-type specificity of the effect.

      Our HT1080 fibrosarcoma ChIP-seq data (ref. [18]; Sup. Fig. 3a,b) show TRF2 binding to TAN gene promoters in a fibroblast-derived model, with enrichment in neurogenesis-related genes (refs. [19,20]). In fibroblasts TRF2 depletion, as expected, induce telomere dysfunction and DDR (Fig. 2d; Sup. Fig. 2a), and eventually cell-cycle arrest and cell death as also reported earlier (van Steensel et al., 1998; Smogorzewska & de Lange, 2002). Therefore, the suggested experiments which would require sustained TRF2-depletion are not possible to perform in fibroblasts. TRF2 occupancy on the promoter of the genes in question in cells other than NSC was noted in HT1080 cells (ref. [18]; Sup. Fig. 3a,b).

      No references are provided for the TRF2 posttranslational modifications on R17, K176, K190 and T188. What is the evidence for these modifications, and is it known if they participate in the telomeric role of TRF2?

      These lines with references have been included in the manuscript (highlighted in blue).

      R17 methylation enhances telomere stability (66). K176/K190 acetylation stabilizes telomeres and is deacetylated by SIRT6 (67). T188 phosphorylation facilitates telomere repair after DSBs(68). These PTMs primarily support telomeric roles.

      The experiments in Fig 5 should also be performed with WT TRF2, to confirm that effects are not due to the overexpression of TRF2.

      WT TRF2 shows no differentiation phenotype and change in TAN gene expression (Fig. 1f,g; 3h, Sup Fig. 5a). Confirming effects are not due to TRF2 overexpression.

      Fig 5c has not been described in the text, and there are multiple technical problems with the TRF2 WT experiment: i) There appears to be significant background binding of REST to the IgG beads, though this blot has such high background it is hard to tell (the REST blot in Fig S4b is also of poor quality), ii) TRF2 is migrating at two different positions in the Input and IP lanes, and the TRF2 band in the K176R blot is at a different position to either, and iii) the relative loading of the Input and IP lanes is not indicated, so it's not clear why K176R appears to be so enriched in the IP.

      We acknowledge the oversight in not citing Fig 5c in the manuscript. This has been corrected, and, highlighted in blue in the revised manuscript.

      i) Multiple optimization attempts were made for the Co-IP experiments, and the presented figure reflects the best achievable result despite REST blot smearing, a pattern also reported previously (Ref. 65). The TRF2-REST interaction is well established, and a similar background was also observed in the cited study

      ii)Variable migration patterns of TRF2 were also noted in the cited study (Ref. 65), consistent with our observations. Our primary emphasis, however, is on the TRF2 K176R mutant, which clearly disrupts its interaction with REST.

      iii)The input loading corresponds to 10% of the total lysate. As the experiments were conducted independently, variations in transfection and pull-down efficiencies may account for observed differences.

      To rule out indirect effects of the G4 ligands on the results in Fig 6g, the binding of BG4 and TRF2 at the promoters of these genes should be measured by ChIP.

      To confirm that G4 ligand effects on TAN gene promoters are direct, TRF2 occupancy was assessed using ChIP. Significantly decreased occupancy of TRF2 was noted at TAN gene promoters, (added Fig. 6h). This implies that ligand-induced changes in TRF2 binding are directly linked to promoter-level G4 stabilization.

      Minor comments:

      1. The size of all the size markers in western blots should be added to the figures. Size has been included in all the western blots

      2. There are several figure panels that are incorrectly referenced in the text, e.g. Fig S1.1 (e-f) should be Fig S1.1 (e-h); Fig. 1m should be Fig. 1f; Figs 5e and 5f have been swapped.

      Corrected.

      1. Fig S1.4 is not referred to in the text. It is not clear what the purpose of Fig S1.4a is.

      The following line has been included in the manuscript highlighted in blue.

      Neurospheres were characterized using PAX6, a NSC marker (Fig S1.4a).

      Are the experiments in Figs 3e, 4a, 4c and 4e using 4-OHT treatment, or siRNA? If the latter, I don't think a control for the effectiveness of the knockdown in this cell type has been included anywhere in the manuscript.

      It is using siRNA, a western blot showing the effectiveness of knockdown is presented in supplementary figure S4c (now S4a).

      The lanes of the western blots in Fig S4c are not labelled.

      Corrected.

      1. Given that the experiments in Fig 5 were carried out on a background of endogenous WT TRF2 expression, presumably the K176R mutant is having a dominant negative effect. To understand the mechanism of this effect (e.g, is it simply due to replacement of endogenous WT TRF2 at its genomic binding sites by a large excess of exogenous K176R, or is dimerisation with WT TRF2 needed?) it would be helpful to know the relative expression levels of endogenous and K176R TRF2.

      To address the query, qRT-PCR with 3′ UTR-specific primers showed no change in endogenous TRF2 mRNA upon K176R expression in SH-SY5Y cells, while primers detecting total TRF2 revealed ~10-fold higher expression of K176R compared to control (Figure below). This indicates the absence of suppression of endogenous TRF2 mRNA. Given that the mutant's DNA binding is intact (Fig. 5f), the dominant-negative effect of K176R likely arises from overexpression of the exogenous mutant.

      For the sentence "...and critical for transcription factor binding including epigenetic functions that are G4 dependent" (bottom of page 3 of the PDF), the authors cite only their own prior papers, but there are examples from others that could be cited.

      We have incorporated citations from other research groups, now included as references 23-26.

    1. Note: This response was posted by the corresponding author to Review Commons. The content has not been altered except for formatting.

      Learn more at Review Commons


      Reply to the reviewers

      We thank the reviewers for their thoughtful and constructive feedback, which helped us strengthen the study on both the computational and biological side. In response, we added substantial new analyses and results in a total of 26 new supplementary figures and a new supplementary note. Importantly, we demonstrated that our approach generalizes beyond tissue outcomes by predicting final-timepoint morphology clusters from early frames with good accuracy as new Figure 4C. Furthermore, we completely restructured and expanded the human expert panel: six experts now provided >30,000 annotations across evenly spaced time intervals, allowing us to benchmark human predictions against CNNs and classical models under comparable conditions. We verified that morphometric trajectories are robust: PCA-based reductions and nearest-neighbor checks confirmed that patterns seen in t-SNE/UMAP are genuine, not projection artifacts. To test whether z-stacks are required, we re-did all analyses with sum- and maximum-intensity projections across five slices; results were unchanged, showing that single-slice imaging is sufficient. From a bioinformatics perspective, we performed negative-label baselines, downsampling analyses to quantify dataset needs, and statistical tests confirming CNNs significantly outperform classical models. Biologically, we clarified that each well contains one organoid, further introduced the Latent Determination Horizon concept tied to expert visibility thresholds, and discussed limits in cross-experiment transfer alongside strategies for domain adaptation and adaptive interventions. Finally, we clarified methods, corrected terminology and a scaler leak, and made all code and raw data publicly available.

      Together, these revisions in our opinion provide an even clearer, more reproducible, and stronger case for the utility of predictive modeling in retinal organoid development.


      Reviewer #1 (Evidence, reproducibility and clarity (Required)):

      This study presents predictive modeling for developmental outcome in retinal organoids based on high-content imaging. Specifically, it compares the predictive performance of an ensemble of deep learning models with classical machine learning based on morphometric image features and predictions from human experts for four different task: prediction of RPE presence and lense presence (at the end of development) as well as the respective sizes. It finds that the DL model outperforms the other approaches and is predictive from early timepoints on, strongly indicating a time-frame for important decision steps in the developmental trajectory.

      Response: We thank the reviewer for the constructive and thoughtful feedback. In response to the review as found below, we have made substantial revisions and additions to the manuscript. Specifically, we clarified key aspects of the experimental setup, changed terminology regarding training/validation/test sets, and restructured our human expert baseline analysis by collecting and integrating a substantially larger dataset of expert annotations according to suggestion. We introduced the Latent Determination Horizon concept with clearer rationale and grounding. Most importantly, we significantly expanded our interpretability analyses across three CNN architectures and eight attribution methods, providing comprehensive quantitative evaluations and supplementary figures that extend beyond the initial DenseNet121 examples (new Supplementary Figures S29-S37). We also ensured full reproducibility by making both code and raw data publicly available with documentation. While certain advanced interpretability methods (e.g., Discover) could not be integrated despite considerable effort, we believe the revised manuscript presents a robust, well-documented, and carefully qualified analysis of CNN predictions in retinal organoid development.

      Major comments: I find the paper over-all well written and easy to understand. The findings are relevant (see significance statement for details) and well supported. However, I have some remarks on the description and details of the experimental set-up, the data availability and reproducibility / re-usability of the data.

      1. Some details about the experimental set-up are unclear to me. In particular, it seems like there is a single organoid per well, as the manuscript does not mention any need for instance segmentation or tracking to distinguish organoids in the images and associate them over time. Is that correct? If yes, it should be explicitly stated so. Are there any specific steps in the organoid preparation necessary to avoid multiple organoids per well? Having multiple organoids per well would require the aforementioned image analysis steps (instance segmentation and tracking) and potentially add significant complexity to the analysis procedure, so this information is important to estimate the effort for setting up a similar approach in other organoid cultures (for example cancer organoids, where multiple organoids per well are common / may not be preventable in certain experimental settings).

      Response: We thank the reviewer for this question. We agree that these preprocessing steps would add more complexity to our presented preprocessing steps and would definitely be required in some organoid systems. In our experimental setup, there is only one organoid per well which forms spontaneously after cell seeding from (almost) all seeded cells. There are no additional steps necessary in order to ensure this behaviour in our setup. We amended the Methods section to now explicitly state this accordingly (paragraph ‘Organoid timelapse imaging’).

      The terminology used with respect to the test and validation set is contrary to the field, and reporting the results on the test set (should be called validation set), should be avoided since it is used to select models. In more detail: the terms "test set" and "validation set" (introduced in 213-221) are used with the opposite meaning to their typical use in the deep learning literature. Typically, the validation set refers to a separate split that is used to monitor convergence / avoid overfitting during training, and the test set refers to an external set that is used to evaluate the performance of trained models. The study uses these terms in an opposite manner, which becomes apparent from line 624: "best performing model ... judged by the loss of the test set.". Please exchange this terminology, it is confusing to a machine learning domain expert. Furthermore, the performance on the test set (should be called validation set) is typically not reported in graphs, as this data was used for model selection, and thus does not provide an unbiased estimate of model performance. I would remove the respective curves from Figures 3 and 4.

      Response: We are thankful for the reviewers comments on this matter. Indeed, we were using an opposite terminology compared to what is commonly used within the field. We have adjusted the Results, Discussion and Methods sections as well as the figures accordingly. Further, we added a corresponding disclaimer for the code base in the github repository. However, we prefer to not remove the respective curves from the figures. We think that this information is crucial to interpret the variability in accuracy between organoids from the same experiments and organoids acquired from a different, independent experiment. The results suggest that the accuracy for organoids within the same experiments is still higher, indicating to users the potential accuracy drop resulting from independent experiments. As we think that this is crucial information for the interpretability of our results, we would like to still include it side-by-side with the test data in the figures.

      The experimental set-up for the human expert baseline is quite different to the evaluation of the machine learning models. The former is based on the annotation of 4,000 images by seven expert, the latter based on a cross-validation experiments on a larger dataset. First of all, the details on the human expert labeling procedure is very sparse, I could only find a very short description in the paragraph 136-144, but did not find any further details in the methods section. Please add a methods section paragraph that explains in more detail how the images were chosen, how they were assigned to annotators, and if there was any redundancy in annotation, and if yes how this was resolved / evaluated. Second, the fact that the set-up for human experts and ML models is quite different means that these values are not quite comparable in a statistical sense. Ideally, human estimators would follow the same set-up as in ML (as in, evaluate the same test sets). However, this would likely prohibitive in the required effort, so I think it's enough to state this fact clearly, for example by adding a comment on this to the captions of Figure 3 and 4.

      Response: We thank the reviewer for this constructive suggestion. We agree that the curves for human evaluations in the original draft were calculated differently compared to the curves for the classification algorithms, mostly stemming from feasibility of data set annotation at the time. In order to still address this suggestion, we went on to repeat and substantially expand the number of images annotated and thus revised the full human expert annotation. Each one of 6 human experts was asked to predict/interpret 6 images of each organoid within the full dataset. In order to select the images, we divided the time course (0-72h) into 6 evenly spaced intervals of 12 hours. For each interval, one image per organoid and human expert was randomly selected and assigned. This resulted in a total of 31,626 classified images (up from 4000 in the original version of the manuscript), from which the assigned images were overlapping between experts for each source interval but not for the individual images. We then changed the calculation of the curves to be the same as for the classification analysis: F1 data were calculated for each experiment over 6 timeframes and all experts, and plotted within the respective figure. We have amended the Methods section accordingly and replaced the respective curves within Figures 3 and 4 and Supplementary Figures S1, S8 and S19.

      It is unclear to me where the theoretical time window for the Latent Determination Horizon in Figure 5 (also mentioned in line 350) comes from? Please explain this in more detail and provide a citation for it.

      Response: We thank the reviewer for this important point. The Latent Determination Horizon (LDH) is a conceptual framework we introduced in this study to describe the theoretical period during which the eventual presence of a tissue outcome of interest (TOI) is being determined but not yet detectable. It is derived from two main observations in our dataset: (i) the inherent intra- and inter-experimental heterogeneity of organoid outcomes despite standardized protocols, and (ii) the progressive increase in predictive performance of our deep learning models over time, which suggests that informative morphological features only emerge gradually. We have now clarified this rationale in the manuscript (Discussion section) further and explicitly stated that the LDH is a concept we introduce here, rather than a previously described or cited term.

      The timewindow is defined by the TOI visibility, which is defined empirically as indicated by the results of our human expert panel (compare also Supplementary Figure S1).

      The intepretability analysis (Figure 4, 634-639) based on relevance backpropagation was performed based on DenseNet121 only. Why did you choose this model and not the ResNet / MobileNet? I think it is quite crucial to see if there are any differences between these model, as this would show how much weight can be put on the evidence from this analysis and I would suggest to add an additional experiment and supplementary figure on this.

      Response: We thank the reviewer for this important comment regarding the interpretability analysis and the choice of model. In the original submission, we restricted the attribution analyses shown in originial Figure 4C to DenseNet121, which served as our main reference model throughout the study. This choice was made primarily for clarity and to avoid redundancy in the main figures, as all three convolutional neural network (CNN) architectures (DenseNet121, ResNet50, MobileNetV3_Large) achieved comparable classification performance on our tasks.

      In response to the reviewer’s concern, we have now extended the interpretability analyses to include all three CNN architectures and a total of eight attribution methods (new Supplementary Note 1). Specifically, we generated saliency maps for DenseNet121, ResNet50, and MobileNetV3_Large across multiple time points and evaluated them using a systematic set of metrics: pairwise method agreement within each model (new Supplementary Figure S29), cross-model consistency per method (new Supplementary Figure S34), entropy and diffusion of saliencies over time (new Supplementary Figure S35), regional voting overlap across methods (new Supplementary Figure S36), and spatial drift of saliency centers of mass (new Supplementary Figure S37).

      These pooled analyses consistently showed that attribution methods differ markedly in the regions they prioritize, but that their relative behaviors were mostly stable across the three CNN architectures. For example, Grad-CAM and Guided Grad-CAM exhibited strong internal agreement and progressively focused relevance into smaller regions, while gradient-based methods such as DeepLiftSHAP and Integrated Gradients maintained broader and more diffuse relevance patterns but were the most consistent across models. Perturbation-based methods like Feature Ablation and Kernel SHAP often showed decreasing entropy and higher spatial drift, again similarly across architectures.

      To further address the reviewer’s point, we visualized the organoid depicted in original Figure 4C across all three CNNs and all eight attribution methods (new Supplementary Figures S30-S33). These comparisons confirm and extend analysis of the qualitative patterns described in original Figure 4C and show that they are not specific to DenseNet121, but are representative of the general behavior across architectures.

      In sum, we observed notable differences in how relevance was assigned and how consistently these assignments aligned. Highlighted organoid patterns were not consistent enough across attribution methods for us to be comfortable to base unequivocal biological interpretation on them. Nevertheless we believe that the analyses in response to the reviewer’s suggestions (new Supplementary Note 1 and new Supplementary Figures S29-S37) add valuable context to what can be expected from machine learning models in an organoid research setting.

      As we did not base further unequivocal biological claims on the relevance backpropagation, we decided to move the analyses to the Supporting Information and now show a new model predicting organoid morphology by morphometrics clustering at the final imaging timepoint in new Figure 4C in line with suggestions by Reviewer #3.

      The code referenced in the code availability statement is not yet present. Please make it available and ensure a good documentation for reproducibility. Similarly, it is unclear to me what is meant by "The data that supports the findings will be made available on HeiDoc". Does this only refer to the intermediate results used for statistical analysis? I would also recommend to make the image data of this study available. This could for example be done through a dedicated data deposition service such as BioImageArchive or BioStudies, or with less effort via zenodo. This would ensure both reproducibility as well as potential re-use of the data. I think the latter point is quite interesting in this context; as the authors state themselves it is unclear if prediction of the TOIs isn't even possible at an earlier point that could be achieved through model advances, which could be studied by making this data available.

      Response: We thank the reviewer for this comment. We have now made the repository and raw data public on the suggested platform (Zenodo) and apologize for this oversight. The links are contained within the github repository which is stated in the manuscript under “Data availability”.

      Minor comments:

      Line 315: Please add a citation for relevance backpropagation here.

      Response: We have included citations for all relevance backpropagation methods used in the paper.

      Line 591: There seems to be typo: "[...] classification of binary classification [...]"

      Response: Corrected as suggested.

      Line 608: "[...] where the images of individual organoids served as groups [...]" It is unclear to me what this means.

      Response: We wanted to express that organoid images belonging to one organoid were assigned in full to a training/validation set. We have now stated this more clearly in the Methods section.

      Reviewer #1 (Significance (Required)):

      General assessment: This study demonstrates that (retinal) organoid development can be predicted from early timepoints with deep learning, where these cannot be discerned by human experts or simpler machine learning models. This fact is very interesting in itself due to its implication for organoid development, and could provide a valuable tool for molecular analysis of different organoid populations, as outlined by the authors. The contribution could be strengthened by providing a more thorough investigation of what features in the image are predictive at early timepoints, using a more sophisticated approach than relevance backprop, e.g. Discover (https://www.nature.com/articles/s41467-024-51136-9). This could provide further biological insight into the underlying developmental processes and enhance the understanding of retinal organoid development.

      Response: We thank the reviewer for this assessment and suggestion. We agree that identifying image features predictive at early timepoints would add important biological context. We therefore attempted to apply Discover to our dataset. However, we were unable to get the system to run successfully. After considerable effort, we concluded that this approach could not be integrated into our current analysis. Instead, we report our substantially expanded results obtained with relevance backpropagation, which provided the most interpretable and reproducible insights for our study as described above (New Supplementary Note 1, new Supplementary Figures S29-S37).

      Advance: similar studies that predict developmental outcome based on image data, for example cell proliferation or developmental outcome exist. However, to the best of my knowledge, this study is the first to apply such a methodology to organoids and convincingly shows is efficacy and argues is potential practical benefits. It thus constitutes a solid technical advance, that could be especially impactful if it could be translated to other organoid systems in the future.

      Response: We thank the reviewer for this positive assessment of our work and for highlighting its novelty and potential impact. We are encouraged that the reviewer recognizes the value of applying predictive modeling to organoids and the opportunities this creates for translation to other organoid systems.

      Audience: This research is of interest to a technical audience. It will be of immediate interest to researchers working on retinal organoids, who could adapt and use the proposed system to support experiments by better distinguishing organoids during development. To enable this application, code and data availability should be ensured (see above comments on reproducibility). It is also of interest to researchers in other organoid systems, who may be able to adapt the methodology to different developmental outcome predictions. Finally, it may also be of interest to image analysis / deep learning researchers as a dataset to improve architectures for predictive time series modeling.

      My research background: I am an expert in computer vision and deep learning for biomedical imaging, especially in microscopy. I have some experience developing image analysis for (cancer) organoids. I don't have any experience on the wet lab side of this work.

      Response: We thank the reviewer for this encouraging feedback and for recognizing the broad relevance of our work across retinal organoid research, other organoid systems, and the image analysis community. We are pleased that the potential utility of our dataset and methodology is appreciated by experts in computer vision and biomedical imaging. We have now made the repository and raw data public and apologize for this oversight. The links are provided in the manuscript under “Data availability”.

      Constantin Pape


      Reviewer #2 (Evidence, reproducibility and clarity (Required)):

      Summary: Afting et al. present a computational pipeline for analyzing timelapse brightfield images of retinal organoids derived from Medaka fish. Their pipeline processes images along two paths: 1) morphometrics (based on computer vision features from skimage) and 2) deep learning. They discovered, through extensive manual annotation of ground truth, that their deep learning method could predict retinal pigmented epithelium and lens tissue emergence in time points earlier than either morphometrics or expert predictions. Our review is formatted based on the review commons recommendation.

      Response: We thank the reviewer for the detailed and constructive feedback, which has greatly improved the clarity and rigor of our manuscript. In response, we have corrected a potential data leakage issue, re-ran the affected analyses, and confirmed that results remain unchanged. We clarified the use of data augmentation in CNN training, tempered some claims throughout the text, and provided stronger justification for our discretization approach together with new supplementary analyses (New Supplementary Figures S26, S27). We substantially expanded our interpretability analyses across three CNN architectures and eight attribution methods, quantified their consistency and differences (new Supplementary Figures S29, S34-S37, new Supplementary Note 1), and added comprehensive visualizations (New S30-S33). We also addressed technical artifact controls, provided downsampling analyses to support our statement on sample size sufficiency (new Supplementary Figure S28), and included negative-control baselines with shuffled labels in Figures 3 and 4. Furthermore, we improved the clarity of terminology, figures, and methodological descriptions, and we have now made both code and raw data publicly available with documentation. Together, we believe these changes further strengthen the robustness, reproducibility, and interpretability of our study while carefully qualifying the claims.

      Major comments:

      Are the key conclusions convincing?

      Yes, the key conclusion that deep learning outperforms morphometric approaches is convincing. However, several methodological details require clarification. For instance, were the data splitting procedures conducted in the same manner for both approaches? Additionally, the authors note in the methods: "The validation data were scaled to the same range as the training data using the fitted scalers obtained from the training data." This represents a classic case of data leakage, which could artificially inflate performance metrics in traditional machine learning models. It is unclear whether the deep learning model was subject to the same issue. Furthermore, the convolutional neural network was trained with random augmentations, effectively increasing the diversity of the training data. Would the performance advantage still hold if the sample size had not been artificially expanded through augmentation?

      Response: We thank the reviewer for raising these important methodological points. As Reviewer #1 correctly noted, our use of the terms validation and test may have contributed to confusion. To clarify: in the original analysis the scalers were fitted on the training and validation data and then applied to the test data. This indeed constitutes a form of data leakage. We have corrected the respective code, re-ran all analyses that were potentially affected, and did not observe any meaningful change in the reported results. The Methods section has been amended to clarify this important detail.

      For the neural networks, each image was normalized independently (per image), without using dataset-level statistics, thereby avoiding any risk of data leakage.

      Regarding data augmentation, the convolutional neural network was indeed trained with augmentations. Early experiments without augmentation led to severe overfitting, confirming that the performance advantage would not hold without artificially increasing the effective sample size. We have added a clarifying statement in the Methods section to make this explicit.

      Should the authors qualify some of their claims as preliminary or speculative, or remove them altogether? Their claims are currently preliminary, pending increased clarity and additional computational experiments described below.

      Response: We believe our additionally performed computational experiments qualify all the claims we make in the revised version of the manuscript.

      Would additional experiments be essential to support the claims of the paper? Request additional experiments only where necessary for the paper as it is, and do not ask authors to open new lines of experimentation.

      • The authors discretize continuous variables into four bins for classification. However, a regression framework may be more appropriate for preserving the full resolution of the data. At a minimum, the authors should provide a stronger justification for this binning strategy and include an analysis of bin performance. For example, do samples near bin boundaries perform comparably to those near the bin centers? This would help determine whether the discretization introduces artifacts or obscures signals.

      Response: We thank the reviewer for this thoughtful suggestion. We agree that regression frameworks can, in principle, preserve the full resolution of continuous outcome variables. However, in our setting we deliberately chose a discretization approach. First, the discretized outcome categories correspond to ranges of tissue sizes that are biologically meaningful and allow direct comparison to expert annotations. In practice, human experts also tend to judge tissue presence and size in categorical rather than strictly continuous terms, which was mirrored by our human expert annotation strategy. As we aimed to compare deep learning with classical machine learning models and with expert annotations across the same prediction tasks, a categorical outcome formulation provided the most consistent and fair framework. Secondly, the underlying outcome variables did not follow a normal distribution, but instead exhibited a skewed and heterogeneous spread. Regression models trained on such distributions often show biases toward the most frequent value ranges, which may obscure less common but biologically important outcomes. Discretization mitigated this issue by balancing the prediction task across defined size categories.

      In line with the reviewer’s request, we have now analyzed the performance in relation to the distance of each sample from the bin center. These results are provided as new Supplementary Figures S26 and S27. Interestingly, for the classical machine learning classifiers, F1 scores tended to be somewhat higher for samples close to bin edges. For the convolutional neural networks, however, F1 scores were more evenly distributed across distances from bin centers. While the reason for this difference remains unclear, the analysis demonstrates that the discretization did not obscure predictive signals in either framework. We have amended the results section accordingly.

      • The relevance backpropagation interpretation analysis is not convincing. The authors argue that the model's use of pixels across the entire image (rather than just the RPE region) indicates that the deep learning approach captures holistic information. However, only three example images are shown out of hundreds, with no explanation for their selection, limiting the generalizability of the interpretation. Additionally, it is unclear how this interpretability approach would work at all in earlier time points, particularly before the model begins making confident predictions around the 8-hour mark. It is also not specified whether the input used for GradSHAP matches the input used during CNN training. The authors should consider expanding this analysis by quantifying pixel importance inside versus outside annotated regions over time. Lastly, Figure 4C is missing a scale bar, which would aid in interpretability.

      Response: We thank the reviewer for raising these important concerns. In the initial version we showed examples of relevance backpropagation that suggested CNNs rely on visible RPE or lens tissue for their predictions (original Figure 4C). Following the reviewer’s comment, we expanded the analysis extensively across all models and attribution methods (compare new Supplementary Note 1), and quantified agreement, consistency, entropy, regional overlap, and drift (new Supplementary Figures S29 and S34-S37), as well as providing comprehensive visualizations across models and methods (new Supplementary Figures S30-S33).

      This extended analysis showed that attribution methods behave very differently from each other, but consistently so across the three CNN architectures. Each method displayed characteristic patterns, for example in entropy or center-of-mass drift, but the overlap between methods was generally low. While integrated gradients and DeepLiftSHAP tended to concentrate on tissue regions, other methods produced broader or shifting relevance patterns, and overall we could not establish robust or interpretable signals from a biological point of view that would support stronger conclusions.

      We have therefore revised the text to focus on descriptive results only, without making claims about early structural information or tissue-specific cues being used by the networks. We also added missing scale bars and clarified methodological details. Together, the revised section now reflects the extensive work performed while remaining cautious about what can and cannot be inferred from saliency methods in this setting.

      • The authors claim that they removed technical artifacts to the best of their ability, but it is unclear if the authors performed any adjustment beyond manual quality checks for contamination. Did the authors observe any illumination artifacts (either within a single image or over time)? Any other artifacts or procedures to adjust?

      Response: We thank the reviewer for this comment. We have not performed any adjustment beyond manual quality control post organoid seeding. The aforementioned removal of technical artifacts included, among others, seeding at the same time of day, seeding and cell processing by the same investigator according to a standardized protocol, usage of reproducible chemicals (same LOT, frozen only once, etc.) and temperature control during image acquisition. We adhered strictly to internal, previously published workflows that were aimed to reduce any variability due to technical variations during cell harvesting, organoid preparation and imaging. We have clarified this important point in the Methods section.

      • In line 434-436 the authors state "In this work, we used 1,000 organoids in total, to achieve the reported prediction accuracies. Yet, we suspect that as little as ~500 organoids are sufficient to reliably recapitulate our findings." It is unclear what evidence the authors use to support this claim? The authors could perform a downsampling analysis to determine tradeoff between performance and sample size.

      Response: We thank the reviewer for this important comment. To clarify, our statement regarding the sufficiency of ~500 organoids was based on a downsampling-style analysis we had already performed. In this analysis, we systematically reduced the number of experiments used for training and assessed predictive performance for both CNN- and classifier-based approaches (former Supplementary Figure S11, new Supplementary Figure S28). For CNNs, performance curves plateaued at approximately six experiments (corresponding to ~500 organoids), suggesting that increasing the sample size further only marginally improved prediction accuracy. In contrast, we did not observe a clear plateau for the machine learning classifiers, indicating that these models can achieve comparable performance with fewer training experiments. We have revised the manuscript text to clarify that this conclusion is derived from these analyses, and continue to include Supplementary Figure S11 as new Supplementary Figure S28 for transparency (compare Supplementary Note 1).

      Are the suggested experiments realistic in terms of time and resources? It would help if you could add an estimated cost and time investment for substantial experiments. Yes, we believe all experiments are realistic in terms of time and resources. We estimate all experiments could be completed in 3-6 months.

      Response: We confirm that the suggested experiments are realistic in terms of time and resources and have been able to complete them within 6 months.

      Are the data and the methods presented in such a way that they can be reproduced? No, the code is not currently available. We were not able to review the source code.

      Response: We have now made the repository public. We apologize for this initial oversight. The links are provided in the revised version of the manuscript under “Data availability”.

      Are the experiments adequately replicated and statistical analysis adequate?

      • The experiments are adequately replicated.

      • The statistical analysis (deep learning) is lacking a negative control baseline, which would be helpful to observe if performance is inflated.

      Response: We thank the reviewer for this comment. We have calculated the respective curves with neural networks and machine learning classifiers that were trained on data with shuffled labels and have included these results as a separate curve in the respective Figures 3 and 4. We have also amended the Methods section accordingly.

      Minor comments:

      Specific experimental issues that are easily addressable.

      Are prior studies referenced appropriately?

      Yes.

      Are the text and figures clear and accurate?

      The authors must improve clarity on terminology. For example, they should define a comprehensive dataset, significant, and provide clarity on their morphometrics feature space. They should elaborate on what they mean by "confounding factor of heterogeneity".

      Response: We thank the reviewer for highlighting the need to clarify terminology. We have revised the manuscript accordingly. Specifically, we now explicitly define comprehensive dataset as longitudinal brightfield imaging of ~1,000 organoids from 11 independent experiments, imaged every 30 minutes over several days, covering a wide range of developmental outcomes at high temporal resolution. Furthermore, we replaced the term significantly with wording that avoids implying statistical significance, where appropriate. We have clarified the morphometrics feature space in the Methods section in a more detailed fashion, describing the custom parameters that we used to enhance the regionprops_table function of skimage.

      Do you have suggestions that would help the authors improve the presentation of their data and conclusions? - Figure 2C describes a distance between what? The y axis is likely too simple. Same confusion over Figure 2D. Was distance computed based on tsne coordinates?

      Response: We thank the reviewer for pointing out this potential source of confusion. The distances shown in original Figures 2C and 2D were not calculated in tSNE space. Instead, morphometrics features were first Z-scaled, and then dimensionality reduction by PCA was applied, with the first 20 principal components retaining ~93% of the variance. Euclidean distances were subsequently computed in this 20-dimensional PC space. For inter-organoid distances (Figure 2C), we calculated mean pairwise Euclidean distances between all organoids at each imaging time point, capturing the global divergence of organoid morphologies over time in an experiment-specific manner. For intra-organoid distances (Figure 2D), we calculated Euclidean distances between consecutive time points (n vs. n+1) for each individual organoid, thereby quantifying the extent of morphological change within organoids over time. We have revised the Figure legend and Methods section to make these definitions clearer.

      • The authors perform a Herculean analysis comparing dozens of different machine learning classifiers. They select two, but they should provide justification for this decision.

      Response: We thank the reviewer for this comment. In our initial machine learning analyses, we systematically benchmarked a broad set of classifiers on the morphometrics feature space, using cross-validation and hyperparameter tuning where appropriate. The classifiers that we ultimately focused on were those that consistently achieved the best performance in these comparisons. This process is described in the Methods and summarized in the Supplementary Figures S4 and S15 (for sum- and maximum-intensity z-projections new Supplementary Figures S5/6 and S16/17), which show the results of the benchmarking. We have clarified the text to state that the selected classifiers were chosen on the basis of their superior performance in these evaluations.

      • It would be good to get a sense for how these retinal organoids grow - are they moving all over the place? They are in Matrigel so maybe not, but are they rotating?

      Can the author's approach predict an entire non-emergence experiment? The authors tried to standardize protocol, but ultimately if It's deriving this much heterogeneity, then how well it will actually generalize to a different lab is a limitation.

      Response: We thank the reviewer for these thoughtful questions. The retinal organoids in our study were embedded in low concentrations of Matrigel and remained relatively stable in position throughout imaging. We did not observe substantial displacement or lateral movement of organoids, and no systematic rotation could be detected in our dataset. Small morphological rearrangements within organoids were observed, but the gross positioning of organoids within the wells remained consistent across time-lapse recordings.

      Regarding generalization across laboratories, we agree with the reviewer that this is an important limitation. While we minimized technical variability by adhering to a highly standardized, published protocol (see Methods), considerable heterogeneity remained at both intra- and inter-experimental levels. This variability likely reflects inherent properties of the system, similar the reportings in the literature across organoid systems, rather than technical artifacts, and poses a potential challenge for applying our models to independently generated datasets. We therefore highlight the need for future work to test the robustness of our models across laboratories, which will be essential to determine the true generalizability of our approach. We have amended the Discussion accordingly.

      • The authors should dampen claims throughout. For example, in the abstract they state, "by combining expert annotations with advanced image analysis". The image analysis pipelines use common approaches.

      Response: We thank the reviewer for this comment. We agree that the individual image analysis steps we used, such as morphometric feature extraction, are based on well-established algorithms. By referring to “advanced image analysis,” we intended to highlight not the novelty of each single algorithm, but rather the way in which we systematically combined a large number of quantitative parameters and leveraged them through machine learning models to generate predictive insights into organoid development.

      • The authors state: "the presence of RPE and lenses were disagreed upon by the two independently annotating experts in a considerable fraction of organoids (3.9 % for RPE, 2.9% for lenses).", but it is unclear why there were two independently annotating experts. The supplements say images were split between nine experts for annotation.

      Response: We thank the reviewer for pointing out this ambiguity. To clarify, the ground truth definition at the final time point was established by two experts who annotated all organoids. These two annotators were part of the larger group of six experts who contributed to the earlier human expert annotation tasks. Thus, while six experts provided annotations for subsets of images during the expert prediction experiments, the final annotation for every single organoid at its last time frame was consistently performed by the same two experts to ensure a uniform ground truth. We have amended this in the revised manuscript to make this distinction clear.

      • Details on the image analysis pipeline would be helpful to clarify. For example, why did they choose to measure these 165 morphology features? Which descriptors were used to quantify blur? Did the authors apply blur metrics per FOV or per segmented organoid?

      Response: We thank the reviewer for this comment. To clarify, we extracted 165 morphometric features per segmented organoid, combining standard scikit-image region properties with custom implementations (e.g., blur quantified as the variance of the Laplace filter response within the organoid mask). All metrics, including blur, were calculated per segmented organoid rather than per full field of view. This broad feature space was deliberately chosen to capture size, shape, and intensity distributions in a comprehensive and unbiased manner. We now provide a more detailed description of the preprocessing steps, the full feature list, and the exact code implementations are provided in the Methods section (“Large-scale time-lapse Image analysis”) of the revised version of the manuscript as well as in the source code github repository.

      • The description of the number of images is confusing and distracts from the number of organoids. The number of organoids and number of timepoints used would provide a better description of the data with more value. For example, does this image count include all five z slices?

      Response: We thank the reviewer for this comment. The reported image count includes slice 3 only, which we based our models on. The five z-slices that we used to create the MAX- and SUM-intensity z-projections would increase this number 5-fold. While we agree that the number of organoids and time points are highly informative metrics and have provided these details in the manuscript, we also believe that reporting the image count is valuable, as it directly reflects the size of the dataset processed by our analysis pipelines. For this reason, we prefer to keep the current description.

      • The authors should consider applying a maximum projection across the five z slices (rather than the middle z) as this is a common procedure in image analysis. Why not analyze three-dimensional morphometrics or deep learning features? Might this improve performance further?

      Response: We thank the reviewer for this valuable suggestion. To address this point, we repeated all analyses using both sum- and maximum-intensity z-projections and have included the results as new Supplementary Figures S8-S10, S13/S14 for TOI emergence and new Supplementary Figures S19-S21, S24/S25 for TOI sizes (classifier benchmarking and hyperparameter tuning in new Supplementary Figures S5/S6 and S16/S17). These additional analyses did not reveal a noticeable improvement in performance, suggesting that projections incorporating all slices are not strictly necessary in our setting. An analysis that included all five z-slices separately for classification would indeed be of interest, but was not feasible within the scope of this study, as it would substantially increase the computational demands beyond the available resources and timeframe.

      • There is a lot of manual annotation performed in this work, the authors could speculate how this could be streamlined for future studies. How does the approach presented enable streamlining?

      Response: We thank the reviewer for raising this important point. The current study relied on expert visual review, which is time-intensive, but our findings suggest several ways to streamline future work. For instance, model-assisted prelabeling could be used to automatically accept high-confidence cases while routing only uncertain cases to experts. Active sampling strategies, focusing expert review on boundary cases or rare classes, as well as programmatic checks from morphometrics (e.g., blur or contrast to flag low-quality frames), could further reduce effort. Consensus annotation could be reserved only for cases where the model and expert disagree or confidence is low. Finally, new experiments could be bootstrapped with a small seed set of annotated organoids for fine-tuning before switching to such a model-assisted workflow. These possibilities are enabled by our approach, where organoids are imaged individually, morphometrics provide automated quality indicators, and the CNN achieves reliable performance at early developmental stages, making model-in-the-loop annotation a feasible and efficient strategy for future studies. We have added a clarifying paragraph to the Discussion accordingly.

      Reviewer #2 (Significance (Required)):

      Describe the nature and significance of the advance (e.g. conceptual, technical, clinical) for the field. The paper's advance is technical (providing new methods for organoid quality control) and conceptual (providing proof of concept that earlier time points contain information to predict specific future outcomes in retinal organoids)

      Place the work in the context of the existing literature (provide references, where appropriate).

      • The authors do a good job of placing their work in context in the introduction.
      • The work presents a simple image analysis pipeline (using only the middle z slice) to process timelapse organoid images. So not a 4D pipeline (time and space), just 3D (time). It is likely that more and more of these approaches will be developed over time, and this article is one of the early attempts.

      • The work uses standard convolutional neural networks.

      Response: We thank the reviewer for this assessment. We agree that our work represents one of the early attempts in this direction, applying a straightforward pipeline with standard convolutional neural networks, and we appreciate the reviewer’s acknowledgment of how the study has been placed in context within the Introduction.

      State what audience might be interested in and influenced by the reported findings. - Data scientists performing image-based profiling for time lapse imaging of organoids.

      • Retinal organoid biologists

      • Other organoid biologists who may have long growth times with indeterminate outcomes.

      Response: We thank the reviewer for outlining the relevant audiences. We agree that the reported findings will be of interest to data scientists working on image-based profiling, retinal organoid biologists, and more broadly to organoid researchers facing long culture times with uncertain developmental outcomes.

      Define your field of expertise with a few keywords to help the authors contextualize your point of view. Indicate if there are any parts of the paper that you do not have sufficient expertise to evaluate. - Image-based profiling/morphometrics

      • Organoid image analysis

      • Computational biology

      • Cell biology

      • Data science/machine learning

      • Software

      This is a signed review:

      Gregory P. Way, PhD

      Erik Serrano

      Jenna Tomkinson

      Michael J. Lippincott

      Cameron Mattson

      Department of Biomedical Informatics, University of Colorado


      Reviewer #3 (Evidence, reproducibility and clarity (Required)):

      Summary:

      This manuscript by Afting et. al. addresses the challenge of heterogeneity in retinal organoid development by using deep learning to predict eventual tissue outcomes from early-stage images. The central hypothesis is that deep learning can forecast which tissues an organoid will form (specifically retinal pigmented epithelium, RPE, and lens) well before those tissues become visibly apparent. To test this, the authors assembled a large-scale time-lapse imaging dataset of ~1,000 retinal organoids (~100,000 images) with expert annotations of tissue outcomes. They characterized the variability in organoid morphology and tissue formation over time, focusing on two tissues: RPE (which requires induction) and lens (which appears spontaneously). The core finding is that a deep learning model can accurately predict the emergence and size of RPE and lens in individual organoids at very early developmental stages. Notably, a convolutional neural network (CNN) ensemble achieved high predictive performance (F1-scores ~0.85-0.9) hours before the tissues were visible, significantly outperforming human experts and classical image-analysis-based classifiers. This approach effectively bypasses the issue of stochastic developmental heterogeneity and defines an early "determination window" for fate decisions. Overall, the study demonstrates a proof-of-concept that artificial intelligence can forecast organoid differentiation outcomes non-invasively, which could revolutionize how organoid experiments are analyzed and interpreted.

      Recommendation:

      While this manuscript addresses an important and timely scientific question using innovative deep learning methodologies, it currently cannot be recommended for acceptance in its present form. The authors must thoroughly address several critical limitations highlighted in this report. In particular, significant issues remain regarding the generalizability of the predictive models across different experimental conditions, the interpretability of deep learning predictions, and the use of Euclidean distance metrics in high-dimensional morphometric spaces-potentially leading to distorted interpretations of organoid heterogeneity. These revisions are essential for validating the general applicability of their approach and enhancing biological interpretability. After thoroughly addressing these concerns, the manuscript may become suitable for future consideration.

      Response: We thank the reviewer for the thoughtful and constructive comments. In response, we expanded our analyses in several key ways. We clarified limitations regarding external datasets. Interpretability analyses were greatly extended across three CNN architectures and eight attribution methods (new Supplementary Figures S29-S37, new Supplementary Note 1), showing consistent but method-specific behaviors; as no reproducible biologically interpretable signals emerged, we now present these results descriptively and clearly state their limitations. We further demonstrated the flexibility of our framework by predicting morphometric clusters in addition to tissue outcomes (new Figure 4C), confirmed robustness of the morphometrics space using PCA and nearest-neighbor analyses (new Supplementary Figure S3), and added statistical tests confirming CNNs significantly outperform classical classifiers (Supplementary File 1). Finally, we made all code and raw data publicly available, clarified species context, and added forward-looking discussion on adaptive interventions. We believe these revisions now further improve the rigor and clarity of our work.

      Major Issues (with Suggestions):

      1. Generalization to Other Batches or Protocols: The drop in performance on independent validation experiments suggests the model may partially overfit to specific experimental conditions. A major concern is how well this approach would work on organoids from a different batch or produced by a slightly different differentiation protocol. Suggestion: The authors should clarify the extent of variability between their "independent experiment" and training data (e.g., were these done months apart, with different cell lines or minor protocol tweaks?). To strengthen confidence in the model's robustness, I recommend testing the trained model on one or more truly external datasets, if available (for instance, organoids generated in a separate lab or under a modified protocol). Even a modest analysis showing the model can be adapted (via transfer learning or re-training) to another dataset would be valuable. If new data cannot be added, the authors should explicitly discuss this limitation and perhaps propose strategies (like domain adaptation techniques or more robust training with diverse conditions) to handle batch effects in future applications.

      Response: We thank the reviewer for this important comment. We fully agree with the reviewer that this would be an amazing addition to the manuscript. Unfortunately we are not able to obtain the requested external data set. Although retinal organoid systems exist and are widely used across different species lines, to the best of our knowledge our laboratory is the only one currently raising retinal organoids from primary embryonic pluripotent stem cells of Oryzias latipes and there is currently only one known (and published) differentiation protocol which allows the successful generation of these organoids. We note that our datasets were collected over the course of nine months, which already introduces variability across time and thus partially addresses concerns regarding batch effects. While we did not have access to truly external datasets (e.g., from other laboratories), we have clarified this limitation as suggested in the revised version of the manuscript and outlined strategies such as domain adaptation and training on more diverse conditions as promising future directions to improve robustness.

      Biological Interpretation of Early Predictive Features: The study currently concludes that the CNN picks up on complex, non-intuitive features that neither human experts nor conventional analysis could identify. However, from a biological perspective, it would be highly insightful to know what these features are (e.g., subtle texture, cell distribution patterns, etc.). Suggestion: I encourage the authors to delve deeper into interpretability. They might try complementary explainability techniques (for example, occlusion tests where parts of the image are masked to see if predictions change, or activation visualization to see what patterns neurons detect) beyond GradientSHAP. Additionally, analyzing false predictions might provide clues: if the model is confident but wrong for certain organoids, what visual traits did those have? If possible, correlating the model's prediction confidence with measured morphometrics or known markers (if any early marker data exist) could hint at what the network sees. Even if definitive features remain unidentified, providing the reader with any hypothesis (for instance, "the network may be sensing a subtle rim of pigmentation or differences in tissue opacity") would add value. This would connect the AI predictions back to biology more strongly.

      Response: We thank the reviewer for this thoughtful suggestion. We agree that linking CNN predictions to specific biological features would be highly valuable. In response, we expanded our interpretability analyses beyond GradientSHAP to a broad set of attribution methods and quantified their behavior across models and timepoints (new Supplementary Figures S29-S37, new Supplementary Note 1). While some methods (e.g., Integrated Gradients, DeepLiftSHAP) occasionally highlighted visible tissue regions, others produced diffuse or shifting relevance, and overall overlap was low. Therefore, our results did not yield reproducible, interpretable biological signals.

      Given these results, we have refrained from speculating about specific early image features and now present the interpretability analyses descriptively. We agree that future studies integrating imaging with molecular markers will be required to directly link early predictive cues to defined biological processes.

      Expansion to Other Outcomes or Multi-Outcome Prediction: The focus on RPE and lens is well-justified, but these are two outcomes within retinal organoids. A major question is whether the approach could be extended to predict other cell types or structures (e.g., presence of certain retinal neurons, or malformations) or even multiple outcomes at once. Suggestion: The authors should discuss the generality of their approach. Could the same pipeline be trained to predict, say, photoreceptor layer formation or other features if annotated? Are there limitations (like needing binary outcomes vs. multi-class)? Even if outside the scope of this study, a brief discussion would reassure readers that the method is not intrinsically limited to these two tissues. If data were available, it would be interesting to see a multi-label classification (predict both RPE and lens presence simultaneously) or an extension to other organoid systems in future. Including such commentary would highlight the broad applicability of this platform.

      Response: We thank the reviewer for this helpful and important suggestion. While our study focused on RPE and lens as the most readily accessible tissues of interest in retinal organoids, our new analyses demonstrate that the pipeline is not limited to these outcomes. In addition to tissue-specific predictions, we trained both a convolutional neural network (on image data) and a decision tree classifier (on morphometrics features) to predict more abstract morphological clusters defined at the final timepoint using the morphometrics features, showing that both approaches could successfully capture non-tissue features from early frames (new Figure 4C). This illustrates that the framework can be extended beyond binary tissue outcomes to multi-class problems, and predict relevant outcomes like the overall organoid morphology. Given appropriate annotations, the framework could in principle be trained to detect additional structures such as photoreceptor layers or malformations. Furthermore, the CNN architecture we employed and the morphometrics feature space are compatible with multi-label classification, meaning simultaneous prediction of several outcomes would also be feasible. We have clarified this point in the discussion to highlight the methodological flexibility and potential generality of our approach and are excited to share this very interesting, additional model with the readership.

      Curse of high dimensionality: Using Euclidean distance in a 165-dimensional morphometric space likely suffers from the curse of dimensionality, which diminishes the meaning of distances as dimensionality increases. In such high-dimensional settings, the range of pairwise distances tends to collapse, undermining the ability to discern meaningful intra- vs. inter-organoid differences. Suggestion: To address this, I would encourage the authors to apply principal component analysis (PCA) in place of (or prior to) tSNE. PCA would reduce the data to a few dominant axes of variation that capture most of the morphometric variance, directly revealing which features drive differences between organoids. These principal components are linear combinations of the original 165 parameters, so one can examine their loadings to identify which morphometric traits carry the most information - yielding interpretable axes of biological variation (e.g., organoid size, shape complexity, etc.). In addition, I would like to mention an important cautionary remark regarding tSNE embeddings. tSNE does not preserve global geometry of the data. Distances and cluster separations in a tSNE map are therefore not faithful to the original high-dimensional distances and should be interpreted with caution. See Chari T, Pachter L (2023), The specious art of single-cell genomics, PLoS Comput Biol 19(8): e1011288, for an enlightening discussion in the context of single cell genomics. The authors have shown that extreme dimensionality reduction to 2D can introduce significant distortions in the data's structure, meaning the apparent proximity or separation of points in a tSNE plot may be an artifact of the algorithm rather than a true reflection of morphometric similarity. Implementing PCA would mitigate high-dimensional distance issues by focusing on the most informative dimensions, while also providing clear, quantitative axes that summarize organoid heterogeneity. This change would strengthen the analysis by making the results more robust (avoiding distance artifacts) and biologically interpretable, as each principal component can be traced back to specific morphometric features of interest.

      Response: We thank the reviewer for this mention. Indeed, high dimensionality and dimensionality reductions can lead to false interpretations. We approached this issue as follows: First, we calculated the same TSNE projections and distances using the first 20 PCs and supplied these data as the new Figure 2 and new Supplementary Figure 2. While the scale of the data shifted slightly, there were no differences in the data distribution that would contradict our prior conclusions.

      In order to confirm the findings and further emphasize the validity of our dimensionality reduction, we calculated the intersection of 30 nearest neighbors in raw data space (or pca space) compared and 30 nearest neighbors in reduced space (TSNE or UMAP, as we wanted to emphasize that this was not an effect specific for TSNE projections and would also be valid in a dimensionality reduction which is more known to preserve global structure rather than local structure). As shown in the new Supplementary Figure S3 (A-D), the high jaccard index confirmed that our projections accurately reflect the data’s structure obtained from raw distance measurements. Moreover, the jaccard index generally increased over time, which is best explained by a stronger morphological similarity of organoids at timepoint 0 and reflected by the dense point cloud in the TSNE projections at that timepoint. The described effects were independent of the usage of data derived from 20 PCs versus data derived from all 165 dimensions.

      We next wanted to confirm the conclusion that data points obtained from organoids at later timepoints were more closely related to each other than data points from different organoids. We therefore identified the 30 nearest neighbor data points, showing that at later timepoints these 30 nearest neighbor data points were almost all attributable to the same organoid (new Supplementary Figure S3 E/F). This was only not the case for experiments that lacked in between timepoints (E007 and E002), therefore misaligning the organoids in the reduced space and convoluting the nearest neighbor analysis.

      We have included the respective new Figures and new Supplementary Figures and linked them in the main manuscript.

      Statistical Reporting and Significance: The manuscript focuses on F1-score as the metric to report accuracy over time, which is appropriate. However, it's not explicitly stated whether any statistical significance tests were performed on the differences between methods (e.g., CNN vs human, CNN vs classical ML). Suggestion: The authors could report statistical significance of the performance differences, perhaps using a permutation test or McNemar's test on predictions. For example, is the improvement of the CNN ensemble over the Random Forest/QDA classifier statistically significant across experiments? Given the n of organoids, this should be assessable. Demonstrating significance would add rigor to the analysis.

      Response: We thank the reviewer for this helpful suggestion. Following the recommendation, we quantified per-experiment differences in predictive performance by calculating the area under the F1-score curves (AUC) for each classifier and experiment. We then compared methods using paired Wilcoxon signed-rank tests across experiments, with Holm-Bonferroni correction for multiple comparisons. This analysis confirmed that the CNN consistently and significantly outperformed the baseline models and classical machine learning classifiers in validation and test organoids, while CNNs were notably but not significantly better performing in test organoids for RPE area and lens sizes compared to the machine learning classifiers. In summary, the findings add the requested statistical rigor to our findings. The results of these tests are now provided in the Supplementary Material as Supplementary File 1.

      Minor Issues (with Suggestions):

      1. Data Availability: Given the resource-intensive nature of the work, the value to the community will be highest if the data is made publicly available. I understand that this is of course at the behest of the authors and they do mention that they will make the data available upon publication of the manuscript. For the time being, the authors can consider sharing at least a representative subset of the data or the trained model weights. This will allow others to build on their work and test the method in other contexts, amplifying the impact of the study.

      Response: We have now made the repository and raw data public and apologize for this oversight. The link for the github repository is now provided in the manuscript under “Data availability”, while the links for the datasets are contained within the github repository.

      Discussion - Future Directions: The Discussion does a good job of highlighting applications (like guiding molecular analysis). One minor addition could be speculation on using this approach to actively intervene: for example, could one imagine altering culture conditions mid-course for organoids predicted not to form RPE, to see if their fate can be changed? The authors touch on reducing variability by focusing on the window of determination; extending that thought to an experimental test (though not done here) would inspire readers. This is entirely optional, but a sentence or two envisioning how predictive models enable dynamic experimental designs (not just passive prediction) would be a forward-looking note to end on.

      Response: We thank the reviewer for this constructive suggestion. We have expanded the discussion to briefly address how predictive modeling could go beyond passive observation. Specifically, we now discuss that predictive models may enable dynamic interventions, such as altering culture conditions mid-course for organoids predicted not to form RPE, to test whether their developmental trajectory can be redirected. While outside the scope of the present work, this forward-looking perspective emphasizes how predictive modeling could inspire adaptive experimental strategies in future studies.

      I believe with the above clarifications and enhancements - especially regarding generalizability and interpretability - the paper will be suitable for broad readership. The work represents an exciting intersection of developmental biology and AI, and I commend the authors for this contribution.

      Response: We thank the reviewer for the positive assessment and their encouraging remarks regarding the contribution of our work to these fields.

      Novelty and Impact:

      This work fills an important gap in organoid biology and imaging. Previous studies have used deep learning to link imaging with molecular profiles or spatial patterns in organoids, but there remained a "notable gap" in predicting whether and to what extent specific tissues will form in organoids. The authors' approach is novel in applying deep learning to prospectively predict organoid tissue outcomes (RPE and lens) on a per-organoid basis, something not previously demonstrated in retinal organoids. Conceptually, this is a significant advance: it shows that fate decisions in a complex 3D culture model can be predicted well in advance, suggesting the existence of subtle early morphogenetic cues that only a sophisticated model can discern. The findings will be of broad interest to researchers in organoid technology, developmental biology, and biomedical AI.

      Response: We thank the reviewer for this thoughtful and encouraging assessment. We agree that our study addresses an important gap by prospectively predicting tissue outcomes at the single-organoid level, and we appreciate the recognition that this represents a conceptual advance with relevance not only for retinal organoids but also for broader applications in organoid biology, developmental biology, and biomedical AI.

      Methodological Rigor and Technical Quality:

      The study is methodologically solid and carefully executed. The authors gathered a uniquely large dataset under consistent conditions, which lends statistical power to their analyses. They employ rigorous controls: an expert panel provided human predictions as a baseline, and a classical machine learning pipeline using quantitative image-derived features was implemented for comparison. The deep learning approach is well-chosen and technically sound. They use an ensemble of CNN architectures (DenseNet121, ResNet50, and MobileNetV3) pre-trained on large image databases, fine-tuning them on organoid images. The use of image segmentation (DeepLabV3) to isolate the organoid from background is appropriate to ensure the models focus on the relevant morphology. Model training procedures (data augmentation, cross-entropy loss with class balancing, learning rate scheduling, and cross-validation) are thorough and follow best practices. The evaluation metrics (primarily F1-score) are suitable for the imbalanced outcomes and emphasize prediction accuracy in a biologically relevant way. Importantly, the authors separate training, test, and validation sets in a meaningful manner: images of each organoid are grouped to avoid information leakage, and an independent experiment serves as a validation to test generalization. The observation that performance is slightly lower on independent validation experiments underscores both the realism of their evaluation and the inherent heterogeneity between experimental batches. In addition, the study integrates interpretability (using GradientSHAP-based relevance backpropagation) to probe what image features the network uses. Although the relevance maps did not reveal obvious human-interpretable features, the attempt reflects a commendable thoroughness in analysis. Overall, the experimental design, data analysis, and reporting are of high quality, supporting the credibility of the conclusions.

      Response: We thank the reviewer for their very positive and detailed assessment. We appreciate the recognition of our efforts to ensure methodological rigor and reproducibility, and we agree that interpretability remains an important but challenging area for future work.

      Reviewer #3 (Significance (Required)):

      Scientific Significance and Conceptual Advances:

      Biologically, the ability to predict organoid outcomes early is quite significant. It means researchers can potentially identify when and which organoids will form a given tissue, allowing them to harvest samples at the right moment for molecular assays or to exclude organoids that will not form the desired structure. The manuscript's results indicate that RPE and lens fate decisions in retinal organoids are made much earlier than visible differentiation, with predictive signals detectable as early as ~11 hours for RPE and ~4-5 hours for lens. This suggests a surprising synchronization or early commitment in organoid development that was not previously appreciated. The authors' introduction of deep learning-derived determination windows refines the concept of a developmental "point of no return" for cell fate in organoids. Focusing on these windows could help in pinpointing the molecular triggers of these fate decisions. Another conceptual advance is demonstrating that non-invasive imaging data can serve a predictive role akin to (or better than) destructive molecular assays. The study highlights that classical morphology metrics and even expert eyes capture mainly recognition of emerging tissues, whereas the CNN detects subtler, non-intuitive features predictive of future development. This underlines the power of deep learning to uncover complex phenotypic patterns that elude human analysis, a concept that could be extended to other organoid systems and developmental biology contexts. In sum, the work not only provides a tool for prediction but also contributes conceptual insights into the timing of cell fate determination in organoids.

      Response: We thank the reviewer for this thoughtful and positive assessment. We agree that the determination windows provide a valuable framework to study early fate decisions in organoids, and we have emphasized this point in the discussion to highlight the biological significance of our findings.

      Strengths:

      The combination of high-resolution time-lapse imaging with advanced deep learning is innovative. The authors effectively leverage AI to solve a biological uncertainty problem, moving beyond qualitative observations to quantitative predictions. The study uses a remarkably large dataset (1,000 organoids, >100k images), which is a strength as it captures variability and provides robust training data. This scale lends confidence that the model isn't overfit to a small sample. By comparing deep learning with classical machine learning and human predictions, the authors provide context for the model's performance. The CNN ensemble consistently outperforms both the classical algorithms and human experts, highlighting the value added by the new method. The deep learning model achieves high accuracy (F1 > 0.85) at impressively early time points. The fact that it can predict lens formation just ~4.5 hours into development with confidence is striking. Performance remained strong and exceeded human capability at all assessed times. Key experimental and analytical steps (segmentation, cross-validation between experiments, model calibration, use of appropriate metrics) are executed carefully. The manuscript is transparent about training procedures and even provides source code references, enhancing reproducibility. The manuscript is generally well-written with a logical flow from the problem (organoid heterogeneity) to the solution (predictive modeling) and clear figures referenced.

      Response: We thank the reviewer for this very positive and encouraging assessment of our study, particularly regarding the scale of our dataset, the methodological rigor, and the reproducibility of our approach.

      Weaknesses and Limitations:

      Generalizability Across Batches/Conditions: One limitation is the variability in model performance on organoids from independent experiments. The CNN did slightly worse on a validation set from a separate experiment, indicating that differences in the experimental batch (e.g., slight protocol or environmental variations) can affect accuracy. This raises the question of how well the model would generalize to organoids generated under different protocols or by other labs. While the authors do employ an experiment-wise cross-validation, true external validation (on a totally independent dataset or a different organoid system) would further strengthen the claim of general applicability.

      Response: We thank the reviewer for this important point. We agree that generalizability across batches and experimental conditions is a key consideration. We have carefully revised the discussion to explicitly address this limitation and to highlight the variability observed between independent experiments.

      Interpretability of the Predictions: Despite using relevance backpropagation, the authors were unable to pinpoint clear human-interpretable image features that drive the predictions. In other words, the deep learning model remains somewhat of a "black box" in terms of what subtle cues it uses at early time points. This limits the biological insight that can be directly extracted regarding early morphological indicators of RPE or lens fate. It would be ideal if the study could highlight specific morphological differences (even if minor) correlated with fate outcomes, but currently those remain elusive.

      Response: We thank the reviewer for raising this important point. Indeed, while our models achieved robust predictive performance, the underlying morphological cues remained difficult to interpret using relevance backpropagation. We believe this limitation reflects both the subtlety of the early predictive signals and the complexity of the features captured by deep learning models, which may not correspond to human-intuitive descriptors. We have clarified this limitation in the Discussion and Supplementary Note 1 and emphasize that further methodological advances in interpretability, or integration with complementary molecular readouts, will be essential to uncover the precise morphological correlates of fate determination.

      Scope of Outcomes: The study focuses on two particular tissues (RPE and lens) as the outcomes of interest. These were well-chosen as examples (one induced, one spontaneous), but they do not encompass the full range of retinal organoid fates (e.g., neural retina layers). It's not a flaw per se, but it means the platform as presented is specialized. The method might need adaptation to predict more complex or multiple tissue outcomes simultaneously.

      Response: We agree with the reviewer that our study focuses on two specific tissues, RPE and lens, which served as proof-of-concept outcomes representing both induced and spontaneous differentiation events. While this scope is necessarily limited, we believe it demonstrates the general feasibility of our approach. We have clarified in the Discussion that the same framework could, in principle, be extended to additional retinal fates such as neural retina layers, or even to multi-label prediction tasks, provided appropriate annotations are available. We now provide additional experiments showing that even abstract morphological classes are well predictable. This will be an important next step to broaden the applicability of our platform.

      Requirement of Large Data and Annotations: Practically, the approach required a very large imaging dataset and extensive manual annotation; each organoid's RPE and lens outcome, plus manual masking for training the segmentation model. This is a substantial effort that may be challenging to reproduce widely. The authors suggest that perhaps ~500 organoids might suffice to achieve similar results, but the data requirement is still high. Smaller labs or studies with fewer organoids might not immediately reap the full benefits of this approach without access to such imaging throughput.

      Response: We thank the reviewer for highlighting this important point. We agree that the generation of a large imaging dataset and the associated annotations represent a substantial investment of time and resources. At the same time, we consider this effort highly relevant, as it reflects the intrinsic heterogeneity of organoid systems rather than technical artifacts, and therefore ensures robust model training. We have clarified this limitation in the discussion. While our full dataset included ~1,000 organoids, our downsampling analysis suggests that as few as ~500 organoids may already be sufficient to reproduce the key findings, which we believe makes the approach feasible for many organoid systems (compare new Supplementary Note 1). Moreover, as we outline in the Discussion, future refinements such as combining image- and tabular-based features or incorporating fluorescence data could further enhance predictive power and reduce annotation effort.

      Medaka Fish vs. Other Systems: The retinal organoids in this study appear to be from medaka fish, whereas much organoid research uses human iPSC-derived organoids. It's not fully clear in the manuscript as to how the findings translate to mammalian or human organoids. If there are species-specific differences, the applicability to human retinal organoids (which are important for disease modeling) might need discussion. This is a minor point if the biology is conserved, but worth noting as a potential limitation.

      Response: We thank the reviewer for pointing out this important consideration. We have now explicitly clarified in the Discussion that our proof-of-concept study was performed in medaka organoids, which offer high reproducibility and rapid development. While species-specific differences may exist, the predictive framework is not inherently restricted to medaka and should, in principle, be transferable to mammalian or human iPSC/ESC-derived organoids, provided sufficiently annotated datasets are available. We have amended the Discussion accordingly.

      Predicting Tissue Size is Harder: The model's accuracy in predicting how much tissue (relative area) an organoid will form, while good, is notably lower than for simply predicting presence/absence. Final F1 scores for size classes (~0.7) indicate moderate success. This implies that quantitatively predicting organoid phenotypic severity or extent is more challenging, perhaps due to more continuous variation in size. The authors do acknowledge the lower accuracy for size and treat it carefully.

      Response: We thank the reviewer for this observation and agree with their interpretation. We have already acknowledged in the manuscript that predicting tissue size is more challenging than predicting tissue presence/absence, and we believe we have treated these results with appropriate caution in the revised version of the manuscript.

      Latency vs. Determination: While the authors narrow down the time window of fate determination, it remains somewhat unclear whether the times at which the model reaches high confidence truly correspond to the biological "decision point" or are just the earliest detection of its consequences. The manuscript discusses this caveat, but it's an inherent limitation that the predictive time point might lag the actual internal commitment event. Further work might be needed to link these predictions to molecular events of commitment.

      Response: We agree with the reviewer. As noted in the Discussion, the time points identified by our models likely reflect the earliest detectable morphological consequences of fate determination, rather than the exact molecular commitment events themselves. Establishing a direct link between predictive signals and underlying molecular mechanisms will require future experimental work.

    1. And crawled head downward down a blackened wall And upside down in air were towers Tolling reminiscent bells, that kept the hours And voices singing out of empty cisterns and exhausted wells

      Last year, Addie annotated this exact section and described how Eliot purposefully confuses the reader's sense of right-side-up and upside-down. In an especially insightful section of analysis she claims that if the reader were to orient herself with respect to Dracula (whom "crawled head downward down a blackened wall"), the tower down which he crawls becomes inverted - and the corresponding Tarot Card, the Dark Tower, is similarly flipped. Nested in this idea is a broader understanding: that in the chaos and turbulency of the modern world, the only form of agency we truly have is our perspective. When Dracula is flipped upside down, the world appears to him inverted; and though in fact it remains exactly the same as it always was, in his mind's eye all has been reoriented. That's precisely Eliot's point. Though the world itself may be a wasteland, there exists a copy of this world - a world of shadows, of impressions, of perspectives and opinions - which is completely up to interpretation. I think he invokes Tarot as a way of imbuing this doppelganger realm with purpose and value: Tarot is all about perspective. Your interpretation of the card, and what it tells you about your life in this theoretical duplicate of reality, informs the way you act in the real physical world - and so perhaps our agency, though constrained to our own perspectives, is more powerful than we think. The following two lines are relevant insofar as they condense several central thematic discussions: the voices, time, familiarity and remembrance, and water. All of these strands weave together a picture of reality IN FACT: that is, a world in which people are consigned to make the same mistakes over and over, a world where several voices overlap but never really hear one another, a world analogous to a dry rock. I think Eliot piles up all these images to drive home the fact that though our perspectives may change (though the Dark Tower may become inverted, or vice versa), objective reality is constant. In this way he DOES put a pessimistic constraint on the extent to which our conception of life can actually influence the events occuring around us; but nevertheless I do think there are some shards of positivity embedded in there.

    2. But dry sterile thunder without rain

      This line stood out to me due to its connection with the title "What the Thunder said," and similar connotation to the Gospel of John. This line appears after a somewhat odd repetition of a lack of water within the land. Rather, the speaker is left in a desolate landscape of "only rock." One may think that this baren image would also prompt a stillness of silence in nature. However, Eliot is quick to point out the presence of loud booms of thunder in my highlighted line. In particular, the thunder is "dry and sterile," therefor connecting to the state of the land; the rocky terrain is indeed also dry due to the emphasized absence of water and also sterile as a result. In The Gospel of John (line 29), thunder holds a contrasting purpose. 29] The people therefore, that stood by, and heard it, said that it thundered: others said, An angel spake to him. Therefore, the voice of God in John is expressed through thunder, showing the great force of divinity over the world. However, Eliot's vivid descriptions of the thunder in his wasteland could not be more different. The thunder is "dry and sterile." and in my opinion, lacks the religious importance evident in John, In connection the title, my reading of this line suggests that Eliot does not believe the thunder is saying anything (What the Thunder Said). Instead, we are trapped in a dry and sterile land mass with no divine connection to guide us out.

    1. Note: This response was posted by the corresponding author to Review Commons. The content has not been altered except for formatting.

      Learn more at Review Commons


      Reply to the reviewers

      Manuscript number: RC-2024-02830

      Corresponding author(s): Julien, Sage

      1. General Statements

      We thank the Reviewers for a fair review of our work and helpful suggestions. We have significantly revised the manuscript in response to these suggestions. We provide a point-by-point response to the Reviewers below but wanted to highlight in our response a recurring concern related to the strong cell cycle arrest observed upon the acute FAM53C knock-down being different than the limited phenotypes in other contexts, including the knockout mice and DepMap data.

      First, we now show that we can recapitulate the strong G1 arrest resulting from the FAM53C knock-down using two independent siRNAs in RPE-1 cells, supporting the specificity of the effects.

      Second, the G1 arrest that results from the FAM53C knock-down is also observed in cells with inactive p53, suggesting it is not due to a non-specific stress response due to “toxic” siRNAs. In addition, the arrest is dependent on RB, which fits with the genetic and biochemical data placing FAM53C upstream of RB, further supporting a specific phenotype.

      Third, we have performed experiments in other human cells, including cancer cell lines. As would be expected for cancer cells, the G1 arrest is less pronounced but is still significant, indicating that the G1 arrest is not unique to RPE-1 cells.

      Fourth, it is not unexpected that compensatory mechanisms would be activated upon loss of FAM53C during development or in cancer – which may explain the lack of phenotypes in vivo or upon long-term knockout. This has been true for many cell cycle regulators, either because of compensation by other family members that have overlapping functions, or by a larger scale rewiring of signaling pathways.

      2. Point-by-point description of the revisions

      __Reviewer #1 (Evidence, reproducibility and clarity (Required)): __

      Summary:

      Taylar Hammond and colleagues identified new regulators of the G1/S transition of the cell cycle. They did so by screening public available data from the Cancer Dependency Map, and identified FAM53C as a positive regulator of the G1/S transition. Using biochemical assays they then show that FAM53 interacts with the DYRK1A kinase to inhibit its function. DYRK1A in its is known to induce degradation of cyclin D, leading the authors to propose a model in which DYRK1A-dependent cyclin D degradation is inhibited by FAM53C to permit S-phase entry. Finally the authors assess the effect of FAM53C deletion in a cortical organoid model, and in Fam53c knockout mice. Whereas proliferation of the organoids is indeed inhibited, mice show virtually no phenotype.

      Major comments:

      The authors show convincing evidence that FAM53C loss can reduce S-phase entry in cell cultures, and that it can bind to DYRK1A. However, FAM53 has multiple other binding partners and I am not entirely convinced that negative regulation of DYRK1A is the predominant mechanism to explain its effects on S-phase entry. Some of the claims that are made based on the biochemical assays, and on the physiological effects of FAM53C are overstated. In addition, some choices made methodology and data representation need further attention.

      1. The authors do note that P21 levels increase upon FAM53C. They show convincing evidence that this is not a P53-dependent response. But the claim that " p21 upregulation alone cannot explain the G1 arrest in FAM53C-deficient cells (line 138-139) is misleading. A p53-independent p21 response could still be highly relevant. The authors could test if FAM53C knockdown inhibits proliferation after p21 knockdown or p21 deletion in RPE1 cells. The Reviewer raises a great point. Our initial statement needed to be clarified and also need more experimental support. We have performed experiments where we knocked down FAM53C and p21 individually, as well as in combination, in RPE-1 cells. These experiment show that p21 knock-down is not sufficient to negate the cell cycle arrest resulting from the FAM53C knock-down in RPE-1 cells (Figure 4B,C and Figure S4C,D).

      We now extended these experiments to conditions where we inhibited DYRK1A, and we also compared these data to experiments in p53-null RPE-1 cells. Altogether, these experiments point to activation of p53 downstream of DYRK1A activation upon FAM53C knock-down, and indicate that p21 is not the only critical p53 target in the cell cycle arrest observed in FAM53C knock-down cells (Figure 4 and Figure S4).

      The authors do not convincingly show that FAM53C acts as a DYRK1A inhibitor in cells. Figures 4B+C and S4B+C show extremely faint P-CycD1 bands, and tiny differences in ratios. The P values are hovering around the 0.05, so n=3 is clearly underpowered here. Total CycD1 levels also correlate with FAM53C levels, which seems to affect the ratios more than the tiny pCycD1 bands. Why is there still a pCycD1 band visible in 4B in the GFP + BTZ + DYRK1Ai condition? And if I look at the data points I honestly don't understand how the authors can conclude from S4C that knockdown of siFAM53C increases (DYRK1A dependent) increases in pCycD1 (relative to total CycD1). In figure 5C, no blot scans are even shown, and again the differences look tiny. So the authors should either find a way to make these assays more robust, or alter their claims appropriately.

      We appreciate these comments from the Reviewer and have significantly revised the manuscript to address them.

      The analysis of Cyclin D phosphorylation and stability are complicated by the upregulation of p21 upon FAM53C knock-down, in particular because p21 can be part of Cyclin D complexes, which may affect its protein levels in cells (as was nicely showed in a previous study from the lab of Tobias Meyer – Chen et al., Mol Cell, 2013). Instead of focusing on Cyclin D levels and stability, we refocused the manuscript on RB and p53 downstream of FAM53C loss.

      We removed previous panel 4B from the revised manuscript. For panels 4E and S4B (now panels S3J and S3K)), we used a true “immunoassay” (as indicated in the legend – not an immunoblot), which is much more quantitative and avoids error-prone steps in standard immunoblots (“Western blots”). Briefly, this system was developed by ProteinSimple. It uses capillary transfer of proteins and ELISA-like quantification with up to 6 logs of dynamic range (see their web site https://www.proteinsimple.com/wes.html). The “bands” we show are just a representation of the luminescence signals in capillaries. We made sure to further clarify the figure legends in the revised manuscript.

      The representative Western blot images for 5C-D (now 5F-G) in the original submission are shown in Figure 5E, we apologize if this was not clear. The differences are small, which we acknowledge in the revised manuscript. Note that several factors can affect Cyclin D levels in cells, including the growth rate and the stage of the cell cycle. Our FACS analysis shows that normal organoids have ~63% of cells in G1 and ~13% in S phase; the overall lower proportion of S-phase cells in organoids may make the immunoblot difference appear smaller, with fewer cycling cells resulting in decreased Cyclin D phosphorylation.

      Nevertheless, the Reviewer brings up a good point and comments from this Reviewer and the others made us re-think how to best interpret our results. As discussed above, we re-read carefully the Meyer paper and think that FAM53C’s role and DYRK1A activity in cells may be understood when considering levels of both CycD and p21 at the same time in a continuum. While our genetic and biochemical data support a role for FAM53C in DYRK1A inhibition, it is likely that the regulation of cell cycle progression by FAM53C is not exclusively due to this inhibition. As discussed above and below, we noted an upregulation of p21 upon FAM53C knock-down, and activation of p53 and its targets likely contributes significantly to the phenotypes observed. We added new experiments to support this more complex model (Figure 4 and Figure S4, with new model in S4L).

      The experiments to test if DYRK1A inhibition could rescue the G1 arrest observed upon FAM53C knockdown are not entirely convincing either. It would be much more convincing if they also perform cell counting experiments as they have done in Figures 1F and 1G, to complement the flow cytometry assays. I suggest that the authors do these cell counting experiments in RPE1 +/- P53 cells as well as HCT116 cells. In addition, did the authors test if P21 is induced by DYRK1Ai in HCT116 cells?

      We repeated the experiments with the DYRK1A inhibitor and counted the cells. In p53-null RPE-1 cells, we found that cell numbers do not increase in these conditions where we had observed a cell cycle re-entry (Fig. 4E), which was accompanied by apoptotic cell death (Fig. S4I). Thus, cells re-enter the cell cycle but die as they progress through S-phase and G2/M. We note that inhibition of DYRK1A has been shown to decrease expression of G2/M regulators (PMID: 38839871), which may contribute to the inability of cells treated to DYRK1Ai to divide. Because our data in RPE-1 cells showed that p21 knock-down was not sufficient to allow the FAM53C knock-down cells to re-enter the cell cycle, we did not further analyze p21 in HCT-116 cells.

      The data in Figure 5C and 5D are identical, although they are supposed to represent either pCycD1 ratios or p21 levels. This is a problem because at least one of the two cannot be true. Please provide the proper data and show (representative) images of both data types.

      We apologize for these duplicated panels in the original submission. We now replaced the wrong panel with the correct data (Fig. 5F,G).

      Line 246: "Fam53c knockout mice display developmental and behavioral defects." I don't agree with this claim. The mutant mice are born at almost the expected Mendelian ratios, the body weight development is not consistently altered. But more importantly, no differences in adult survival or microscopic pathology were seen. The authors put strong emphasis on the IMPC behavioral analysis, but they should be more cautious. The IMPC mouse cohorts are tested for many other phenotypes related to behavior and neurological symptoms and apparently none of these other traits were changed in the IMPC Famc53c-/- cohort. Thus, the decreased exploration in a new environment could very well be a chance finding. The authors need to take away claims about developmental and behavioral defects from the abstract, results and discussion sections; the data are just too weak to justify this.

      We agree with the Reviewer that, although we observed significant p-values, this original statement may not be appropriate in the biological sense. We made sure in the revised manuscript to carefully present these data.

      Minor comments:

      Can the authors provide a rationale for each of the proteins they chose to generate the list of the 38 proteins in the DepMap analysis? I looked at the list and it seems to me that they do not all have described functions in the G1/S transition. The analysis may thus be biased.

      To address this point, we updated Table S1 (2nd tab) to provide a better rationale for the 38 factors chosen. Our focus was on the canonical RB pathway and we included RB binding proteins whose function had suggested they may also be playing a role in the G1/S transition. We do agree that there is some bias in this selection (e.g., there are more RB binding factors described) but we hope the Reviewer will agree with us that this list and the subsequent analysis identified expected factors, including FAM53C. Future studies using this approach and others will certainly identify new regulators of cell cycle progression.

      Figure 1B is confusing to me. Are these just some (arbitrarily) chosen examples? Consider leaving this heatmap out altogether, of explain in more detail.

      We agree with the Reviewer that this panel was not necessarily useful and possibly in the wrong place, and we removed it from the manuscript. We replaced it with a cartoon of top hits in the screen.

      The y-axes in Figures 2C, 2D, 2E, and 4D are misleading because they do not start at 0. Please let the axis start at 0, or make axis breaks.

      We re-graphed these panels.

      Line 229: " Consequences ... brain development." This subheader is misleading, because the in vitro cortical organoid system is a rather simplistic model for brain development, and far away from physiological brain development. Please alter the header.

      We changed the header to “Consequences of FAM53C inactivation in human cortical organoids in culture”.

      Figure S5F: the gating strategy is not clear to me. In particular, how do the authors know the difference between subG1 and G1 DAPI signals? Do they interpret the subG1 as apoptotic cells? If yes, why are there so many? Are the culturing or harvesting conditions of these organoids suboptimal? Perhaps the authors could consider doing IF stainings on EdU or BrdU on paraffin sections of organoids to obtain cleaner data?

      Thank you for your feedback. The subG1 population in the original Figure S5F represents cells that died during the dissociation step of the organoids for FACS analysis. To address this point, we performed live & dead staining to exclude dead cells and provide clearer data. We refined gating strategy for better clarity in the new S5F panel.

      Figure S6A; the labeling seems incorrect. I would think that red is heterozygous here, and grey mutant.

      We fixed this mistake, thank you.

      __Reviewer #1 (Significance (Required)): __

      The finding that the poorly studied gene FAM53C controls the G1/S transition in cell lines is novel and interesting for the cell cycle field. However, the lack of phenotypes in Famc53-/- mice makes this finding less interesting for a broader audience. Furthermore, the mechanisms are incompletely dissected. The importance of a p53-indepent induction of p21 is not ruled out. And while the direct inhibitory interaction between FAM53C and DYRK1A is convincing (and also reported by others; PMID: 37802655), the authors do not (yet) convincingly show that DYRK1A inhibition can rescue a cell proliferation defect in FAM53C-deficient cells.

      Altogether, this study can be of interest to basic researchers in the cell cycle field.

      I am a cell biologist studying cell cycle fate decisions, and adaptation of cancer cells & stem cells to (drug-induced) stress. My technical expertise aligns well with the work presented throughout this paper, although I am not familiar with biolayer interferometry.

      __Reviewer #2 (Evidence, reproducibility and clarity (Required)): __

      Summary

      In this study Hammond et al. investigated the role of Dual-specificity Tyrosine Phosphorylation regulated Kinase 1A (DYRK1) in G1/S transition. By exploiting Dependency Map portal, they identified a previously unexplored protein FAM53C as potential regulator of G1/S transition. Using RNAi, they confirmed that depletion of FAM53C suppressed proliferation of human RPE1 cells and that this phenotype was dependent on the presence protein RB. In addition, they noted increased level of CDKN1A transcript and p21 protein that could explain G1 arrest of FAM53C-depleted cells but surprisingly, they did not observe activation of other p53 target genes. Proteomic analysis identified DYRK1 as one of the main interactors of FAM53C and the interaction was confirmed in vitro. Further, they showed that purified FAM53C blocked the ability of DYRK1 to phosphorylate cyclin D in vitro although the activity of DYRK1 was likely not inhibited (judging from the modification of FAM53C itself). Instead, it seems more likely that FAM53C competes with cyclin D in this assay. Authors claim that the G1 arrest caused by depletion of FAM53C was rescued by inhibition of DYRK1 but this was true only in cells lacking functional p53. This is quite confusing as DYRK1 inhibition reduced the fraction of G1 cells in p53 wild type cells as well as in p53 knock-outs, suggesting that FAM53C may not be required for regulation of DYRK1 function. Instead of focusing on the impact of FAM53C on cell cycle progression, authors moved towards investigating its potential (and perhaps more complex) roles in differentiation of IPSCs into cortical organoids and in mice. They observed a lower level of proliferating cells in the organoids but if that reflects an increased activity of DYRK1 or if it is just an off target effect of the genetic manipulation remains unclear. Even less clear is the phenotype in FAM53C knock-out mice. Authors did not observe any significant changes in survival nor in organ development but they noted some behavioral differences. Weather and how these are connected to the rate of cellular proliferation was not explored. In the summary, the study identified previously unknown role of FAM53C in proliferation but failed to explain the mechanism and its physiological relevance at the level of tissues and organism. Although some of the data might be of interest, in current form the data is too preliminary to justify publication.

      Major points

      1. Whole study is based on one siRNA to Fam53C and its specificity was not validated. Level of the knock down was shown only in the first figure and not in the other experiments. The observed phenotypes in the cell cycle progression may be affected by variable knock-down efficiency and/or potential off target effects. We thank the Reviewer for raising this important point. First, we need to clarify that our experiments were performed with a pool of siRNAs (not one siRNA). Second, commercial antibodies against FAM53C are not of the best quality and it has been challenging to detect FAM53C using these antibodies in our hands – the results are often variable. In addition, to better address the Reviewer’s point and control for the phenotypes we have observed, we performed two additional series of experiments: first, we have confirmed G1 arrest in RPE-1 cells with individual siRNAs, providing more confidence for the specificity of this arrest (Fig. S1B); second, we have new data indicating that other cell lines arrest in G1 upon FAM53C knock-down (Fig. S1E,F and Fig. 4F).

      Experiments focusing on the cell cycle progression were done in a single cell line RPE1 that showed a strong sensitivity to FAM53C depletion. In contrast, phenotypes in IPSCs and in mice were only mild suggesting that there might be large differences across various cell types in the expression and function of FAM53C. Therefore, it is important to reproduce the observations in other cell types.

      As mentioned above, we have new data indicating that other cell lines arrest in G1 upon FAM53C knock-down (three cancer cell lines) (Fig. S1E,F and Fig. 4F).

      Authors state that FAM53C is a direct inhibitor of DYRK1A kinase activity (Line 203), however this model is not supported by the data in Fig 4A. FAM53C seems to be a good substrate of DYRK1 even at high concentrations when phosphorylations of cyclin D is reduced. It rather suggests that DYRK1 is not inhibited by FAM53C but perhaps FAM53C competes with cyclin D. Further, authors should address if the phosphorylation of cyclin D is responsible for the observed cell cycle phenotype. Is this Cyclin D-Thr286 phosphorylation, or are there other sites involved?

      We revised the text of the manuscript to include the possibility that FAM53C could act as a competitive substrate and/or an inhibitor.

      We removed most of the Cyclin D phosphorylation/stability data from the revised manuscript. As the Reviewers pointed out, some of these data were statistically significant but the biological effects were small. As discussed above in our response to Reviewer #1, the analysis of Cyclin D phosphorylation and stability are complicated by the upregulation of p21 upon FAM53C knock-down, in particular because p21 can be part of Cyclin D complexes, which may affect its protein levels in cells (as was nicely showed in a previous study from the lab of Tobias Meyer – Chen et al., Mol Cell, 2013). Instead of focusing on Cyclin D levels and stability, we refocused the manuscript on RB and p53 downstream of FAM53C loss.

      We note, however, that we used specific Thr286 phospho-antibodies, which have been used extensively in the field. Our data in Figure 1 with palbociclib place FAM53C upstream of Cyclin D/CDK4,6. We performed Cyclin D overexpression experiments but RPE-1 cells did not tolerate high expression of Cyclin D1 (T286A mutant) and we have not been able to conduct more ‘genetic’ studies.

      At many places, information on statistical tests is missing and SDs are not shown in the plots. For instance, what statistics was used in Fig 4C? Impact of FAM53C on cyclin D phosphorylation does not seem to be significant. In the same experiment, does DYRK1 inhibitor prevent modification of cyclin D?

      As discussed above, we removed some of these data and re-focused the manuscript on p53-p21 as a second pathway activated by loss of FAM53C.

      Validation of SM13797 compound in terms of specificity to DYRK1 was not performed.

      This is an important point. We had cited an abstract from the company (Biosplice) but we agree that providing data is critical. We have now revised the manuscript with a new analysis of the compound’s specificity using kinase assays. These data are shown in Fig. S3F-H.

      A fraction of cells in G1 is a very easy readout but it does not measure progression through the G1 phase. Extension of the S phase or G2 delay would indirectly also result in reduction of the G1 fraction. Instead, authors could measure the dynamics of entry to S phase in cells released from a G1 block or from mitotic shake off.

      The Reviewer made a good point. As discussed in our response to Reviewer #1, with p53-null RPE-1 cells, we found that cell numbers do not increase in these conditions where we had observed a cell cycle re-entry (Fig. 4E), which was accompanied by apoptotic cell death (Fig. S4I). Thus, cells re-enter the cell cycle but die as they progress through S-phase and G2/M. We note that inhibition of DYRK1A has been shown to decrease expression of G2/M regulators (PMID: 38839871), which may contribute to the inability of cells treated to DYRK1Ai to divide. Because our data in RPE-1 cells showed that p21 knock-down was not sufficient to allow the FAM53C knock-down cells to re-enter the cell cycle, we did not further analyze p21 in HCT-116 cells. These data indicate that G1 entry by flow cytometry will not always translate into proliferation.

      Other points:

      Fig. 2C, 2D, 2E graphs should begin with 0

      We remade these graphs.

      Fig. 5D shows that the difference in p21 levels is not significant in FAM53C-KO cells but difference is mentioned in the text.

      We replaced the panel by the correct panel; we apologize for this error.

      Fig. 6D comparison of datasets of extremely different sizes does not seem to be appropriate

      We agree and revised the text. We hope that the Reviewer will agree with us that it is worth showing these data, which are clearly preliminary but provide evidence of a possible role for FAM53C in the brain.

      Could there be alternative splicing in mice generating a partially functional protein without exon 4? Did authors confirm that the animal model does not express FAM53C?

      We performed RNA sequencing of mouse embryonic fibroblasts derived from control and mutant mice. We clearly identified fewer reads in exon 4 in the knockout cells, and no other obvious change in the transcript (data not shown). However, immunoblot with mouse cells for FAM53C never worked well in our hands. We made sure to add this caveat to the revised manuscript.

      __Reviewer #2 (Significance (Required)): __

      Main problem of this study is that the advanced experimental models in IPSCs and mice did not confirm the observations in the cell lines and thus the whole manuscript does not hold together. Although I acknowledge the effort the authors invested in these experiments, the data do not contribute to the main conclusion of the paper that FAM53C/DYRK1 regulates G1/S transition.

      Reviewer #3 (Evidence, reproducibility and clarity (Required)):

      This paper identifies FAM53C as a novel regulator of cell cycle progression, particularly at the G1/S transition, by inhibiting DYRK1A. Using data from the Cancer Dependency Map, the authors suggest that FAM53C acts upstream of the Cyclin D-CDK4/6-RB axis by inhibiting DYRK1A.

      Specifically, their experiments suggest that FAM53C Knockdown induces G1 arrest in cells, reducing proliferation without triggering apoptosis. DYRK1A Inhibition rescues G1 arrest in P53KO cells, suggesting FAM53C normally suppresses DYRK1A activity. Mass Spectrometry and biochemical assays confirm that FAM53C directly interacts with and inhibits DYRK1A. FAM53C Knockout in Human Cortical Organoids and Mice leads to cell cycle defects, growth impairments, and behavioral changes, reinforcing its biological importance.

      Strength of the paper:

      The study introduces a novel cell cycle control signalling module upstream of CDK4/6 in G1/S regulation which could have significant impact. The identification of FAM53C using a depmap correlation analysis is a nice example of the power of this dataset. The experiments are carried out mostly in a convincing manner and support the conclusions of the manuscript.

      Critique:

      1) The experiments rely heavily on siRNA transfections without the appropriate controls. There are so many cases of off-target effects of siRNA in the literature, and specifically for a strong phenotype on S-phase as described here, I would expect to see solid results by additional experiments. This is especially important since the ko mice do not show any significant developmental cell cycle phenotypes. Moreover, FAM53C does not show a strong fitness effect in the depmap dataset, suggesting that it is largely non-essential in most cancer cell lines. For this paper to reach publication in a high-standard journal, I would expect that the authors show a rescue of the S-phase phenotype using an siRNA-resistant cDNA, and show similar S-phase defects using an acute knock out approach with lentiviral gRNA/Cas9 delivery.

      We thank the Reviewer for this comment. Please refer to the initial response to the three Reviewers, where we discuss our use of single siRNAs and our results in multiple cell lines. Briefly, we can recapitulate the G1 arrest upon FAM53C knock-down using two independent siRNAs in RPE-1 cells. We also observe the same G1 arrest in p53 knockout cells, suggesting it is not due to a non-specific stress response. In addition, the arrest is dependent on RB, which fits with the genetic and biochemical data placing FAM53C upstream of RB, further supporting a specific phenotype. Human cancer cell lines also arrest in G1 upon FAM53C knock-down, not just RPE-1 cells. Finally, we hope the Reviewer will agree with us that compensatory mechanisms are very common in the cell cycle – which may explain the lack of phenotypes in vivo or upon long-term knockout of FAM53C.

      2) The S-phase phenotype following FAM53C should be demonstrated in a larger variety of TP53WT and mutant cell lines. Given that this paper introduces a new G1/S control element, I think this is important for credibility. Ideally, this should be done with acute gRNA/Cas9 gene deletion using a lentiviral delivery system; but if the siRNA rescue experiments work and validate an on-target effect, siRNA would be an appropriate alternative.

      We now show data with three cancer cell lines (U2OS, A549, and HCT-116 – Fig. S1E,F and Fig. 4F), in addition to our results in RPE-1 cells and in human cortical organoids. We note that the knock-down experiments are complemented by overexpression data (Fig. 1G-I), by genetic data (our original DepMap screen), and our biochemical data (showing direct binding of FAM53C to DYRK1A).

      3) The western blot images shown in the MS appear heavily over-processed and saturated (See for example S4B, 4A, B, and E). Perhaps the authors should provide the original un-processed data of the entire gels?

      For several of our panels (e.g., 4E and S4B, now panels S3J and S3K)), we used a true “immunoassay” (as indicated in the legend – not an immunoblot), which is much more quantitative and avoids error-prone steps in standard immunoblots (“Western blots”). Briefly, this system was developed by ProteinSimple. It uses capillary transfer of proteins and ELISA-like quantification with up to 6 logs of dynamic range (see their web site https://www.proteinsimple.com/wes.html). The “bands” we show are just a representation of the luminescence signals in capillaries. We made sure to further clarify the figure legends in the revised manuscript.

      Data in 4A are also not a western blot but a radiograph.

      For immunoblots, we will provide all the source data with uncropped blots with the final submission.

      4) A critical experiment for the proposed mechanism is the rescue of the FAM53C S-phase reduction using DYRK1A inhibition shown in Figure 4. The legend here states that the data were extracted from BrdU incorporation assays, but in Figure S4D only the PI histograms are shown, and the S-phase population is not quantified. The authors should show the BrdU scatterplot and quantify the phenotype using the S-phase population in these plots. G1 measurements from PI histograms are not precise enough to allow for conclusions. Also, why are the intensities of the PI peaks so variable in these plots? Compare, for example, the HCT116 upper and lower panels where the siRNA appears to have caused an increase in ploidy.

      We apologize for the confusion and we fixed these errors, for most of the analyses, we used PI to measure G1 and S-phase entry. We added relevant flow cytometry plots to supplemental figures (Fig. S1G, H, I, as well as Fig. S4E and S4K, and Fig. S5F).

      5) There's an apparent contradiction in how RB deletion rescues the G1 arrest (Figure 2) while p21 seems to maintain the arrest even when DYRK1A is inhibited. Is p21 not induced when FAM53C is depleted in RB ko cells? This should be measured and discussed.

      This comment and comments from the two other Reviewers made us reconsider our model. We re-read carefully the Meyer paper and think that DYRK1A activity may be understood when considering levels of both CycD and p21 at the same time in a continuum (as was nicely showed in a previous study from the lab of Tobias Meyer – Chen et al., Mol Cell, 2013). While our genetic and biochemical data support a role for FAM53C in DYRK1A inhibition, it is obvious that the regulation of cell cycle progression by FAM53C is not exclusively due to this inhibition. As discussed above and below, we noted an upregulation of p21 upon FAM53C knock-down, and activation of p53 and its targets likely contributes significantly to the phenotypes observed. We added new experiments to support this more complex model (Figure 4 and Figure S4, with new model in S4L).

      __Reviewer #3 (Significance (Required)): __

      In conclusion, I believe that this MS could potentially be important for the cell cycle field and also provide a new target pathway that could be relevant for cancer therapy. However, the paper has quite a few gaps and inconsistencies that need to be addressed with further experiments. My main worry is that the acute depletion phenotypes appear so strong, while the gene is non-essential in mice and shows only a minor fitness effect in the depmap screens. More convincing controls are necessary to rule out experimental artefacts that misguide the interpretation of the results.

      We appreciate this comment and hope that the Reviewer will agree it is still important to share our data with the field, even if the phenotypes in mice are modest.

    1. Note: This response was posted by the corresponding author to Review Commons. The content has not been altered except for formatting.

      Learn more at Review Commons


      Reply to the reviewers

      We would like to thank all the reviewers for their valuable comments and criticisms. We have thoroughly revised the manuscript and the resource to address all the points raised by the reviewers. Below, we provide a point-by-point response for the sake of clarity.

      Reviewer #1

      __Evidence, reproducibility and clarity __

      Summary: This manuscript, "MAVISp: A Modular Structure-Based Framework for Protein Variant Effects," presents a significant new resource for the scientific community, particularly in the interpretation and characterization of genomic variants. The authors have developed a comprehensive and modular computational framework that integrates various structural and biophysical analyses, alongside existing pathogenicity predictors, to provide crucial mechanistic insights into how variants affect protein structure and function. Importantly, MAVISp is open-source and designed to be extensible, facilitating reuse and adaptation by the broader community.

      Major comments: - While the manuscript is formally well-structured (with clear Introduction, Results, Conclusions, and Methods sections), I found it challenging to follow in some parts. In particular, the Introduction is relatively short and lacks a deeper discussion of the state-of-the-art in protein variant effect prediction. Several methods are cited but not sufficiently described, as if prior knowledge were assumed. OPTIONAL: Extend the Introduction to better contextualize existing approaches (e.g., AlphaMissense, EVE, ESM-based predictors) and clarify what MAVISp adds compared to each.

      We have expanded the introduction on the state-of-the-art of protein variant effects predictors, explaining how MAVISp departs from them.

      - The workflow is summarized in Figure 1(b), which is visually informative. However, the narrative description of the pipeline is somewhat fragmented. It would be helpful to describe in more detail the available modules in MAVISp, and which of them are used in the examples provided. Since different use cases highlight different aspects of the pipeline, it would be useful to emphasize what is done step-by-step in each.

      We have added a concise, narrative description of the data flow for MAVISp, as well as improved the description of modules in the main text. We will integrate the results section with a more comprehensive description of the available modules, and then clarify in the case studies which modules were applied to achieve specific results.

      OPTIONAL: Consider adding a table or a supplementary figure mapping each use case to the corresponding pipeline steps and modules used.

      We have added a supplementary table (Table S2) to guide the reader on the modules and workflows applied for each case study

      We also added Table S1 to map the toolkit used by MAVISp to collect the data that are imported and aggregated in the webserver for further guidance.

      - The text contains numerous acronyms, some of which are not defined upon first use or are only mentioned in passing. This affects readability. OPTIONAL: Define acronyms upon first appearance, and consider moving less critical technical details (e.g., database names or data formats) to the Methods or Supplementary Information. This would greatly enhance readability.

      We revised the usage of acronyms following the reviewer’s directions of defying them at first appearance.

      • The code and trained models are publicly available, which is excellent. The modular design and use of widely adopted frameworks (PyTorch and PyTorch Geometric) are also strong points. However, the Methods section could benefit from additional detail regarding feature extraction and preprocessing steps, especially the structural features derived from AlphaFold2 models. OPTIONAL: Include a schematic or a table summarizing all feature types, their dimensionality, and how they are computed.

      We thank the reviewer for noticing and praising the availability of the tools of MAVISp. Our MAVISp framework utilizes methods and scores that incorporate machine learning features (such as EVE or RaSP), but does not employ machine learning itself. Specifically, we do not use PyTorch and do not utilize features in a machine learning sense. We do extract some information from the AlphaFold2 models that we use (such as the pLDDT score and their secondary structure content, as calculated by DSSP), and those are available in the MAVISp aggregated csv files for each protein entry and detailed in the Documentation section of the MAVISp website.

      • The section on transcription factors is relatively underdeveloped compared to other use cases and lacks sufficient depth or demonstration of its practical utility. OPTIONAL: Consider either expanding this section with additional validation or removing/postponing it to a future manuscript, as it currently seems preliminary.

      We have removed this section and included a mention in the conclusions as part of the future directions.

      Minor comments: - Most relevant recent works are cited, including EVE, ESM-1v, and AlphaFold-based predictors. However, recent methods like AlphaMissense (Cheng et al., 2023) could be discussed more thoroughly in the comparison.

      We have revised the introduction to accommodate the proper space for this comparison.

      • Figures are generally clear, though some (e.g., performance barplots) are quite dense. Consider enlarging font sizes and annotating key results directly on the plots.

      We have revised Figure 2 and presented only one case study to simplify its readability. We have also changed Figure 3, whereas retained the other previous figures since they seemed less problematic.

      • Minor typographic errors are present. A careful proofreading is highly recommended. Below are some of the issues I identified: Page 3, line 46: "MAVISp perform" -> "MAVISp performs" Page 3, line 56: "automatically as embedded" -> "automatically embedded" Page 3, line 57: "along with to enhance" -> unclear; please revise Page 4, line 96: "web app interfaces with the database and present" -> "presents" Page 6, line 210: "to investigate wheatear" -> "whether" Page 6, lines 215-216: "We have in queue for processing with MAVISp proteins from datasets relevant to the benchmark of the PTM module." -> unclear sentence; please clarify Page 15, line 446: "Both the approaches" -> "Both approaches" Page 20, line 704: "advantage of multi-core system" -> "multi-core systems"

      We have done a proofreading of the entire article, including the points above

      Significance

      General assessment: the strongest aspects of the study are the modularity, open-source implementation, and the integration of structural information through graph neural networks. MAVISp appears to be one of the few publicly available frameworks that can easily incorporate AlphaFold2-based features in a flexible way, lowering the barrier for developing custom predictors. Its reproducibility and transparency make it a valuable resource. However, while the technical foundation is solid and the effort substantial, the scientific narrative and presentation could be significantly improved. The manuscript is dense and hard to follow in places, with a heavy use of acronyms and insufficient explanation of key design choices. Improving the descriptive clarity, especially in the early sections, would greatly enhance the impact of this work.

      Advance

      to the best of my knowledge, this is one of the first modular platforms for protein variant effect prediction that integrates structural data from AlphaFold2 with bioinformatic annotations and even clinical data in an extensible fashion. While similar efforts exist (e.g., ESMfold, AlphaMissense), MAVISp distinguishes itself through openness and design for reusability. The novelty is primarily technical and practical rather than conceptual.

      Audience

      this study will be of strong interest to researchers in computational biology, structural bioinformatics, and genomics, particularly those developing variant effect predictors or analyzing the impact of mutations in clinical or functional genomics contexts. The audience is primarily specialized, but the open-source nature of the tool may diffuse its use among more applied or translational users, including those working in precision medicine or protein engineering.

      Reviewer expertise: my expertise is in computational structural biology, molecular modeling, and (rather weak) machine learning applications in bioinformatics. I am familiar with graph-based representations of proteins, AlphaFold2, and variant effects based on Molecular Dynamics simulations. I do not have any direct expertise in clinical variant annotation pipelines.

      Reviewer #2

      __Evidence, reproducibility and clarity __

      Summary: The authors present a pipeline and platform, MAVISp, for aggregating, displaying and analysis of variant effects with a focus on reclassification of variants of uncertain clinical significance and uncovering the molecular mechanisms underlying the mutations.

      Major comments: - On testing the platform, I was unable to look-up a specific variant in ADCK1 (rs200211943, R115Q). I found that despite stating that the mapped refseq ID was NP_001136017 in the HGVSp column, it was actually mapped to the canonical UniProt sequence (Q86TW2-1). NP_001136017 actually maps to Q86TW2-3, which is missing residues 74-148 compared to the -1 isoform. The Uniprot canonical sequence has no exact RefSeq mapping, so the HGVSp column is incorrect in this instance. This mapping issue may also affect other proteins and result in incorrect HGVSp identifiers for variants.

      We would like to thank the reviewer for pointing out these inconsistencies. We have revised all the entries and corrected them. If needed, the history of the cases that have been corrected can be found in the closed issues of the GitHub repository that we use for communication between biocurators and data managers (https://github.com/ELELAB/mavisp_data_collection). We have also revised the protocol we follow in this regard and the MAVISp toolkit to include better support for isoform matching in our pipelines for future entries, as well as for the revision/monitoring of existing ones, as detailed in the Method Section. In particular, we introduced a tool, uniprot2refseq, which aids the biocurator in identifying the correct match in terms of sequence length and sequence identity between RefSeq and UniProt. More details are included in the Method Section of the paper. The two relevant scripts for this step are available at: https://github.com/ELELAB/mavisp_accessory_tools/

      - The paper lacks a section on how to properly interpret the results of the MAVISp platform (the case-studies are helpful, but don't lay down any global rules for interpreting the results). For example: How should a variant with conflicts between the variant impact predictors be interpreted? Are specific indicators considered more 'reliable' than others?

      We have added a section in Results to clarify how to interpret results from MAVISp in the most common use cases.

      • In the Methods section, GEMME is stated as being rank-normalised with 0.5 as a threshold for damaging variants. On checking the data downloaded from the site, GEMME was not rank-normalised but rather min-max normalised. Furthermore, Supplementary text S4 conflicts with the methods section over how GEMME scores are classified, S4 states that a raw-value threshold of -3 is used.

      We thank the reviewer for spotting this inconsistency. This part in the main text was left over from a previous and preliminary version of the pre-print, we have revised the main text. Supplementary Text S4 includes the correct reference for the value in light of the benchmarking therewithin.

      • Note. This is a major comment as one of the claims is that the associated web-tool is user-friendly. While functional, the web app is very awkward to use for analysis on any more than a few variants at once. The fixed window size of the protein table necessitates excessive scrolling to reach your protein-of-interest. This will also get worse as more proteins are added. Suggestion: add a search/filter bar. The same applies to the dataset window.

      We have changed the structure of the webserver in such a way that now the whole website opens as its own separate window, instead of being confined within the size permitted by the website at DTU. This solves the fixed window size issue. Hopefully, this will improve the user experience.

      We have refactored the web app by adding filtering functionality, both for the main protein table (that can now be filtered by UniProt AC, gene name or RefSeq ID) and the mutations table. Doing this required a general overhaul of the table infrastructure (we changed the underlying engine that renders the tables).

      • You are unable to copy anything out of the tables.
      • Hyperlinks in the tables only seem to work if you open them in a new tab or window.

      The table overhauls fixed both of these issues

      • All entries in the reference column point to the MAVISp preprint even when data from other sources is displayed (e.g. MAVE studies).

      We clarified the meaning of the reference column in the Documentation on the MAVISp website, as we realized it had confused the reviewer. The reference column is meant to cite the papers where the computationally-generated MAVISp data are used, not external sources. Since we also have the experimental data module in the most recent release, we have also refactored the MAVISp website by adding a “Datasets and metadata” page, which details metadata for key modules. These include references to data from external sources that we include in MAVISp on a case-by-case basis (for example the results of a MAVE experiment). Additionally, we have verified that the papers using MAVISp data are updated in https://elelab.gitbook.io/mavisp/overview/publications-that-used-mavisp-data and in the csv file of the interested proteins.

      Here below the current references that have been included in terms of publications using MAVISp data:

      SMPD1

      ASM variants in the spotlight: A structure-based atlas for unraveling pathogenic mechanisms in lysosomal acid sphingomyelinase

      Biochim Biophys Acta Mol Basis Dis

      38782304

      https://doi.org/10.1016/j.bbadis.2024.167260

      TRAP1

      Point mutations of the mitochondrial chaperone TRAP1 affect its functions and pro-neoplastic activity

      Cell Death & Disease

      40074754

      https://doi.org/10.1038/s41419-025-07467-6

      BRCA2

      Saturation genome editing-based clinical classification of BRCA2 variants

      Nature

      39779848

      0.1038/s41586-024-08349-1

      TP53, GRIN2A, CBFB, CALR, EGFR

      TRAP1 S-nitrosylation as a model of population-shift mechanism to study the effects of nitric oxide on redox-sensitive oncoproteins

      Cell Death & Disease

      37085483

      10.1038/s41419-023-05780-6

      KIF5A, CFAP410, PILRA, CYP2R1

      Computational analysis of five neurodegenerative diseases reveals shared and specific genetic loci

      Computational and Structural Biotechnology Journal

      38022694

      https://doi.org/10.1016/j.csbj.2023.10.031

      KRAS

      Combining evolution and protein language models for an interpretable cancer driver mutation prediction with D2Deep

      Brief Bioinform

      39708841

      https://doi.org/10.1093/bib/bbae664

      OPTN

      Decoding phospho-regulation and flanking regions in autophagy-associated short linear motifs

      Communications Biology

      40835742

      10.1038/s42003-025-08399-9

      DLG4,GRB2,SMPD1

      Deciphering long-range effects of mutations: an integrated approach using elastic network models and protein structure networks

      JMB

      40738203

      doi: 10.1016/j.jmb.2025.169359

      Entering multiple mutants in the "mutations to be displayed" window is time-consuming for more than a handful of mutants. Suggestion: Add a box where multiple mutants can be pasted in at once from an external document.

      During the table overhaul, we have revised the user interface to add a text box that allows free copy-pasting of mutation lists. While we understand having a single input box would have been ideal, the former selection interface (which is also still available) doesn’t allow copy-paste. This is a known limitation in Streamlit.

      Minor comments

      • Grammar. I appreciate that this manuscript may have been compiled by a non-native English speaker, but I would be remiss not to point out that there are numerous grammar errors throughout, usually sentence order issues or non-pluralisation. The meaning of the authors is mostly clear, but I recommend very thoroughly proof-reading the final version.

      We have done proofreading on the final version of the manuscript

      • There are numerous proteins that I know have high-quality MAVE datasets that are absent in the database e.g. BRCA1, HRAS and PPARG.

      Yes, we are aware of this. It is far from trivial to properly import the datasets from multiplex assays. They often need to be treated on a case-by-case basis. We are in the process of carefully compiling locally all the MAVE data before releasing it within the public version of the database, so this is why they are missing. We are giving priorities to the ones that can be correlated with our predictions on changes in structural stability and then we will also cover the rest of the datasets handling them in batches. Having said this, we have checked the dataset for BRCA1, HRAS, and PPARG. We have imported the ones for PPARG and BRCA1 from ProtGym, referring to the studies published in 10.1038/ng.3700 and 10.1038/s41586-018-0461-z, respectively. Whereas for HRAS, checking in details both the available data and literature, while we did identify a suitable dataset (10.7554/eLife.27810), we struggled to understand what a sensible cut-off for discriminating between pathogenic and non-pathogenic variants would be, and so ended up not including it in the MAVISp dataset for now. We will contact the authors to clarify which thresholds to apply before importing the data.

      • Checking one of the existing MAVE datasets (KRAS), I found that the variants were annotated as damaging, neutral or given a positive score (these appear to stand-in for gain-of-function variants). For better correspondence with the other columns, those with positive scores could be labelled as 'ambiguous' or 'uncertain'.

      In the KRAS case study presented in MAVISP, we utilized the protein abundance dataset reported in (http://dx.doi.org/10.1038/s41586-023-06954-0) and made available in the ProteinGym repository (specifically referenced at https://github.com/OATML-Markslab/ProteinGym/blob/main/reference_files/DMS_substitutions.csv#L153). We adopted the precalculated thresholds as provided by the ProteinGym authors. In this regard, we are not really sure the reviewer is referring to this dataset or another one on KRAS.

      • Numerous thresholds are defined for stabilizing / destabilizing / neutral variants in both the STABILITY and the LOCAL_INTERACTION modules. How were these thresholds determined? I note that (PMC9795540) uses a ΔΔG threshold of 1/-1 for defining stabilizing and destabilizing variants, which is relatively standard (though they also say that 2-3 would likely be better for pinpointing pathogenic variants).

      We improved the description of our classification strategies for both modules in the Documentation page of our website. Also, we explained more clearly the possible sources of ‘uncertain’ annotations for the two modules in both the web app (Documentation page) and main text. Briefly, in the STABILITY module, we consider FoldX and either Rosetta or RaSP to achieve a final classification. We first classify one and the other independently, according to the following strategy:

      If DDG ≥ 3, the mutation is Destabilizing If DDG ≤ −3, the mutation is Stabilizing If −2 We then compare the classifications obtained by the two methods: if they agree, then that is the final classification, if they disagree, then the final classification is Uncertain. The thresholds were selected based on a previous study, in which variants with changes in stability below 3 kcal/mol were not featuring a markedly different abundance at cellular level [10.1371/journal.pgen.1006739, 10.7554/eLife.49138]

      Regarding the LOCAL_INTERACTION module, it works similarly as for the Stability module, in that Rosetta and FoldX are considered independently, and an implicit classification is performed for each, according to the rules (values in kcal/mol)

      If DDG > 1, the mutation is Destabilizing. If DDG Each mutation is therefore classified for both methods. If the methods agree (i.e., if they classify the mutation in the same way), their consensus is the final classification for the mutation; if they do not agree, the final classification will be Uncertain.

      If a mutation does not have an associated free energy value, the relative solvent accessible area is used to classify it: if SAS > 20%, the mutation is classified as Uncertain, otherwise it is not classified.

      Thresholds here were selected according to best practices followed by the tool authors and more in general in the literature, as the reviewer also noticed.

      • "Overall, with the examples in this section, we illustrate different applications of the MAVISp results, spanning from benchmarking purposes, using the experimental data to link predicted functional effects with structural mechanisms or using experimental data to validate the predictions from the MAVISp modules."

      The last of these points is not an application of MAVISp, but rather a way in which external data can help validate MAVISp results. Furthermore, none of the examples given demonstrate an application in benchmarking (what is being benchmarked?).

      We have revised the statements to avoid this confusion in the reader.

      • Transcription factors section. This section describes an intended future expansion to MAVISp, not a current feature, and presents no results. As such, it should be moved to the conclusions/future directions section.

      We have removed this section and included a mention in the conclusions as part of the future directions.

      • Figures. The dot-plots generated by the web app, and in Figures 4, 5 and 6 have 2 legends. After looking at a few, it is clear that the lower legend refers to the colour of the variant on the X-axis - most likely referencing the ClinVar effect category. This is not, however, made clear either on the figures or in the app.

      The reviewer’s interpretation on the second legend is correct - it does refer to the ClinVar classification. Nonetheless, we understand the positioning of the legend makes understanding what the legend refers to not obvious. We also revised the captions of the figures in the main text. On the web app, we have changed the location of the figure legend for the ClinVar effect category and added a label to make it clear what the classification refers to.

      • "We identified ten variants reported in ClinVar as VUS (E102K, H86D, T29I, V91I, P2R, L44P, L44F, D56G, R11L, and E25Q, Fig.5a)" E25Q is benign in ClinVar and has had that status since first submitted.

      We have corrected this in the text and the statements related to it.

      Significance

      Platforms that aggregate predictors of variant effect are not a new concept, for example dbNSFP is a database of SNV predictions from variant effect predictors and conservation predictors over the whole human proteome. Predictors such as CADD and PolyPhen-2 will often provide a summary of other predictions (their features) when using their platforms. MAVISp's unique angle on the problem is in the inclusion of diverse predictors from each of its different moules, giving a much wider perspective on variants and potentially allowing the user to identify the mechanistic cause of pathogenicity. The visualisation aspect of the web app is also a useful addition, although the user interface is somewhat awkward. Potentially the most valuable aspect of this study is the associated gitbook resource containing reports from biocurators for proteins that link relevant literature and analyse ClinVar variants. Unfortunately, these are only currently available for a small minority of the total proteins in the database with such reports. For improvement, I think that the paper should focus more on the precise utility of the web app / gitbook reports and how to interpret the results rather than going into detail about the underlying pipeline.

      We appreciate the interest in the gitbook resource that we also see as very valuable and one of the strengths of our work. We have now implemented a new strategy based on a Python script introduced in the mavisp toolkit to generate a template Markdown file of the report that can be further customized and imported into GitBook directly (​​https://github.com/ELELAB/mavisp_accessory_tools/). This should allow us to streamline the production of more reports. We are currently assigning proteins in batches for reporting to biocurator through the mavisp_data_collection GitHub to expand their coverage. Also, we revised the text and added a section on the interpretation of results from MAVISp. with a focus on the utility of the web-app and reports.

      In terms of audience, the fast look-up and visualisation aspects of the web-platform are likely to be of interest to clinicians in the interpretation of variants of unknown clinical significance. The ability to download the fully processed dataset on a per-protein database would be of more interest to researchers focusing on specific proteins or those taking a broader view over multiple proteins (although a facility to download the whole database would be more useful for this final group).

      While our website only displays the dataset per protein, the whole dataset, including all the MAVISp entries, is available at our OSF repository (https://osf.io/ufpzm/), which is cited in the paper and linked on the MAVISp website. We have further modified the MAVISp database to add a link to the repository in the modes page, so that it is more visible.

      My expertise. - I am a protein bioinformatician with a background in variant effect prediction and large-scale data analysis.

      Reviewer #3 (Evidence, reproducibility and clarity (Required)):

      Evidence, reproducibility and clarity:

      Summary:

      The authors present MAVISp, a tool for viewing protein variants heavily based on protein structure information. The authors have done a very impressive amount of curation on various protein targets, and should be commended for their efforts. The tool includes a diverse array of experimental, clinical, and computational data sources that provides value to potential users interested in a given target.

      Major comments:

      Unfortunately I was not able to get the website to work correctly. When selecting a protein target in simple mode, I was greeted with a completely blank page in the app window. In ensemble mode, there was no transition away from the list of targets at all. I'm using Firefox 140.0.2 (64-bit) on Ubuntu 22.04. I would like to explore the data myself and provide feedback on the user experience and utility.

      We have tried reproducing the issue mentioned by the reviewer, using the exact same Ubuntu and Firefox versions, but unfortunately failed to produce it. The website worked fine for us under such an environment. The issue experienced by the reviewer may have been due to either a temporary issue with the web server or a problem with the specific browser environment they were working in, which we are unable to reproduce. It would be useful to know the date that this happened to verify if it was a downtime on the DTU IT services side that made the webserver inaccessible.

      I have some serious concerns about the sustainability of the project and think that additional clarifications in the text could help. Currently is there a way to easily update a dataset to add, remove, or update a component (for example, if a new predictor is published, an error is found in a predictor dataset, or a predictor is updated)? If it requires a new round of manual curation for each protein to do this, I am worried that this will not scale and will leave the project with many out of date entries. The diversity of software tools (e.g., three different pipeline frameworks) also seems quite challenging to maintain.

      We appreciate the reviewer’s concerns about long-term sustainability. It is a fair point that we consider within our steering group, who oversee and plans the activities and meet monthly. Adding entries to MAVISp is moving more and more towards automation as we grow. We aim to minimize the manual work where applicable. Still, an expert-based intervention is really needed in some of the steps, and we do not want to renounce it. We intend to keep working on MAVISp to make the process of adding and updating entries as automated as possible, and to streamline the process when manual intervention is necessary. From the point of view of the biocurators, they have three core workflows to use for the default modules, which also automatically cover the source of annotations. We are currently working to streamline the procedures behind LOCAL_INTERACTION, which is the most challenging one. On the data manager and maintainers' side, we have workflows and protocols that help us in terms of automation, quality control, etc, and we keep working to improve them. Among these, we have workflows to use for the old entries updates. As an example, the update of erroneously attributed RefSeq data (pointed out by reviewer 2) took us only one week overall (from assigning revisions and importing to the database) because we have a reduced version of Snakemake for automation that can act on only the affected modules. Also, another point is that we have streamlined the generation of the templates for the gitbook reports (see also answer to reviewer 2).

      The update of old entries is planned and made regularly. We also deposit the old datasets on OSF for transparency, in case someone needs to navigate and explore the changes. We have activities planned between May and August every year to update the old entries in relation to changes of protocols in the modules, updates in the core databases that we interact with (COSMIC, Clinvar etc). In case of major changes, the activities for updates continue in the Fall. Other revisions can happen outside these time windows if an entry is needed or a specific research project and needs updates too.

      Furthermore, the community of people contributing to MAVISp as biocurators or developers is growing and we have scientists contributing from other groups in relation to their research interest. We envision that for this resource to scale up, our team cannot be the only one producing data and depositing it to the database. To facilitate this we launched a pilot for a training event online (see Event page on the website) and we will repeat it once per year. We also organize regular meetings with all the active curators and developers to plan the activities in a sustainable manner and address the challenges we encounter.

      As stated in the manuscript, currently with the team of people involved, automatization and resources that we have gathered around this initiative we can provide updates to the public database every third month and we have been regularly satisfied with them. Additionally, we are capable of processing from 20 to 40 proteins every month depending also on the needs of revision or expansion of analyses on existing proteins. We also depend on these data for our own research projects and we are fully committed to it.

      Additionally, we are planning future activities in these directions to improve scale up and sustainability:

      • Streamlining manual steps so that they are as convenient as fast as possible for our curators, e.g. by providing custom pages on the MAVISp website
      • Streamline and automatize the generation of useful output, for instance the reports, by using a combination of simple automation and large language models
      • Implement ways to share our software and scripts with third parties, for instance by providing ready made (or close to) containers or virtual machines
      • For a future version 2 if the database grows in a direction that is not compatible with Streamlit, the web data science framework we are currently using, we will rewrite the website using a framework that would allow better flexibility and performance, for instance using Django and a proper database backend. On the same theme, according to the GitHub repository, the program relies on Python 3.9, which reaches end of life in October 2025. It has been tested against Ubuntu 18.04, which left standard support in May 2023. The authors should update the software to more modern versions of Python to promote the long-term health and maintainability of the project.

      We thank the reviewer for this comment - we are aware of the upcoming EOL of Python 3.9. We tested MAVISp, both software package and web server, using Python 3.10 (which is the minimum supported version going forward) and Python 3.13 (which is the latest stable release at the time of writing) and updated the instructions in the README file on the MAVISp GitHub repository accordingly.

      We plan on keeping track of Python and library versions during our testing and updating them when necessary. In the future, we also plan to deploy Continuous Integration with automated testing for our repository, making this process easier and more standardized.

      I appreciate that the authors have made their code and data available. These artifacts should also be versioned and archived in a service like Zenodo, so that researchers who rely on or want to refer to specific versions can do so in their own future publications.

      Since 2024, we have been reporting all previous versions of the dataset on OSF, the repository linked to the MAVISp website, at https://osf.io/ufpzm/files/osfstorage (folder: previous_releases). We prefer to keep everything under OSF, as we also use it to deposit, for example, the MD trajectory data.

      Additionally, in this GitHub page that we use as a space to interact between biocurators, developers, and data managers within the MAVISp community, we also report all the changes in the NEWS space: https://github.com/ELELAB/mavisp_data_collection

      Finally, the individual tools are all available in our GitHub repository, where version control is in place (see Table S1, where we now mapped all the resources used in the framework)

      In the introduction of the paper, the authors conflate the clinical challenges of variant classification with evidence generation and it's quite muddled together. They should strongly consider splitting the first paragraph into two paragraphs - one about challenges in variant classification/clinical genetics/precision oncology and another about variant effect prediction and experimental methods. The authors should also note that they are many predictors other than AlphaMissense, and may want to cite the ClinGen recommendations (PMID: 36413997) in the intro instead.

      We revised the introduction in light of these suggestions. We have split the paragraph as recommended and added a longer second paragraph about VEPs and using structural data in the context of VEPs. We have also added the citation that the reviewer kindly recommended.

      Also in the introduction on lines 21-22 the authors assert that "a mechanistic understanding of variant effects is essential knowledge" for a variety of clinical outcomes. While this is nice, it is clearly not the case as we can classify variants according to the ACMG/AMP guidelines without any notion of specific mechanism (for example, by combining population frequency data, in silico predictor data, and functional assay data). The authors should revise the statement so that it's clear that mechanistic understanding is a worthy aspiration rather than a prerequisite.

      We revised the statement in light of this comment from the reviewer

      In the structural analysis section (page 5, lines 154-155 and elsewhere), the authors define cutoffs with convenient round numbers. Is there a citation for these values or were these arbitrarily chosen by the authors? I would have liked to see some justification that these assignments are reasonable. Also there seems to be an error in the text where values between -2 and -3 kcal/mol are not assigned to a bin (I assume they should also be uncertain). There are other similar seemingly-arbitrary cutoffs later in the section that should also be explained.

      We have revised the text making the two intervals explicit, for better clarity.

      On page 9, lines 294-298 the authors talk about using the PTEN data from ProteinGym, rather than the actual cutoffs from the paper. They get to the latter later on, but I'm not sure why this isn't first? The ProteinGym cutoffs are somewhat arbitrarily based on the median rather than expert evaluation of the dataset, and I'm not sure why it's even worth mentioning them when proper classifications are available. Regarding PTEN, it would be quite interesting to see a comparison of the VAMP-seq PTEN data and the Mighell phosphatase assay, which is cited on page 9 line 288 but is not actually a VAMP-seq dataset. I think this section could be interesting but it requires some additional attention.

      We have included the data from Mighell’s phosphatase assay as provided by MAVEdb in the MAVISp database, within the experimental_data module for PTEN, and we have revised the case study, including them and explaining better the decision of supporting both the ProteinGym and MAVEdb classification in MAVISp (when available). See revised Figure3, Table 1 and corresponding text.

      The authors mention "pathogenicity predictors" and otherwise use pathogenicity incorrectly throughout the manuscript. Pathogenicity is a classification for a variant after it has been curated according to a framework like the ACMG/AMP guidelines (Richards 2015 and amendments). A single tool cannot predict or assign pathogenicity - the AlphaMissense paper was wrong to use this nomenclature and these authors should not compound this mistake. These predictors should be referred to as "variant effect predictors" or similar, and they are able to produce evidence towards pathogenicity or benignity but not make pathogenicity calls themselves. For example, in Figure 4e, the terms "pathogenic" and "benign" should only be used here if these are the classifications the authors have derived from ClinVar or a similar source of clinically classified variants.

      The reviewer is correct, we have revised the terminology we used in the manuscript and refers to VEPs (Variant Effect Predictors)

      Minor comments:

      The target selection table on the website needs some kind of text filtering option. It's very tedious to have to find a protein by scrolling through the table rather than typing in the symbol. This will only get worse as more datasets are added.

      We have revised the website, adding a filtering option. In detail, we have refactored the web app by adding filtering functionality, both for the main protein table (that can now be filtered by UniProt AC, gene name, or RefSeq ID) and the mutations table. Doing this required a general overhaul of the table infrastructure (we changed the underlying engine that renders the tables).

      The data sources listed on the data usage section of the website are not concordant with what is in the paper. For example, MaveDB is not listed.

      We have revised and updated the data sources on the website, adding a metadata section with relevant information, including MaveDB references where applicable.

      Figure 2 is somewhat confusing, as it partially interleaves results from two different proteins. This would be nicer as two separate figures, one on each protein, or just of a single protein.

      As suggested by the reviewer, we have now revised the figure and corresponding legends and text, focusing only on one of the two proteins.

      Figure 3 panel b is distractingly large and I wonder if the authors could do a little bit more with this visualization.

      We have revised Figure 3 to solve these issues and integrating new data from the comparison with the phosphatase assay

      Capitalization is inconsistent throughout the manuscript. For example, page 9 line 288 refers to VampSEQ instead of VAMP-seq (although this is correct elsewhere). MaveDB is referred to as MAVEdb or MAVEDB in various places. AlphaMissense is referred to as Alphamissense in the Figure 5 legend. The authors should make a careful pass through the manuscript to address this kind of issues.

      We have carefully proofread the paper for these inconsistencies

      MaveDB has a more recent paper (PMID: 39838450) that should be cited instead of/in addition to Esposito et al.

      We have added the reference that the reviewer recommended

      On page 11, lines 338-339 the authors mention some interesting proteins including BLC2, which has base editor data available (PMID: 35288574). Are there plans to incorporate this type of functional assay data into MAVISp?

      The assay mentioned in the paper refers to an experimental setup designed to investigate mutations that may confer resistance to the drug venetoclax. We started the first steps to implement a MAVISp module aimed at evaluating the impact of mutations on drug binding using alchemical free energy perturbations (ensemble mode) but we are far from having it complete. We expect to import these data when the module will be finalized since they can be used to benchmark it and BCL2 is one of the proteins that we are using to develop and test the new module.

      Reviewer #3 (Significance (Required)):

      Significance:

      General assessment:

      This is a nice resource and the authors have clearly put a lot of effort in. They should be celebrated for their achievments in curating the diverse datasets, and the GitBooks are a nice approach. However, I wasn't able to get the website to work and I have raised several issues with the paper itself that I think should be addressed.

      Advance:

      New ways to explore and integrate complex data like protein structures and variant effects are always interesting and welcome. I appreciate the effort towards manual curation of datasets. This work is very similar in theme to existing tools like Genomics 2 Proteins portal (PMID: 38260256) and ProtVar (PMID: 38769064). Unfortunately as I wasn't able to use the site I can't comment further on MAVISp's position in the landscape.

      We have expanded the conclusions section to add a comparison and cite previously published work, and linked to a review we published last year that frames MAVISp in the context of computational frameworks for the prediction of variant effects. In brief, the Genomics 2 Proteins portal (G2P) includes data from several sources, including some overlapping with MAVISp such as Phosphosite or MAVEdb, as well as features calculated on the protein structure. ProtVar also aggregates mutations from different sources and includes both variant effect predictors and predictions of changes in stability upon mutation, as well as predictions of complex structures. These approaches are only partially overlapping with MAVISp. G2P is primarily focused on structural and other annotations of the effect of a mutation; it doesn’t include features about changes of stability, binding, or long-range effects, and doesn’t attempt to classify the impact of a mutation according to its measurements. It also doesn’t include information on protein dynamics. Similarly, ProtVar does include information on binding free energies, long effects, or dynamical information.

      Audience:

      MAVISp could appeal to a diverse group of researchers who are interested in the biology or biochemistry of proteins that are included, or are interested in protein variants in general either from a computational/machine learning perspective or from a genetics/genomics perspective.

      My expertise:

      I am an expert in high-throughput functional genomics experiments and am an experienced computational biologist with software engineering experience.

    2. Note: This preprint has been reviewed by subject experts for Review Commons. Content has not been altered except for formatting.

      Learn more at Review Commons


      Referee #3

      Evidence, reproducibility and clarity

      Summary:

      The authors present MAVISp, a tool for viewing protein variants heavily based on protein structure information. The authors have done a very impressive amount of curation on various protein targets, and should be commended for their efforts. The tool includes a diverse array of experimental, clinical, and computational data sources that provides value to potential users interested in a given target.

      Major comments:

      Unfortunately I was not able to get the website to work properly. When selecting a protein target in simple mode, I was greeted with a completely blank page in the app window, and in ensemble mode, there was no transition away from the list of targets at all. I'm using Firefox 140.0.2 (64-bit) on Ubuntu 22.04. I would have liked to be able to explore the data myself and provide feedback on the user experience and utility.

      I have some serious concerns about the sustainability of the project and think that additional clarifications in the text could help. Currently is there a way to easily update a dataset to add, remove, or update a component (for example, if a new predictor is published, an error is found in a predictor dataset, or a predictor is updated)? If it requires a new round of manual curation for each protein to do this, I am worried that this will not scale and will leave the project with many out of date entries. The diversity of software tools (e.g., three different pipeline frameworks) also seems quite challenging to maintain.

      On the same theme, according to the GitHub repository, the program relies on Python 3.9, which reaches end of life in October 2025. It has been tested against Ubuntu 18.04, which left standard support in May 2023. The authors should update the software to more modern versions of Python to promote the long-term health and maintainability of the project.

      I appreciate that the authors have made their code and data available. These artifacts should also be versioned and archived in a service like Zenodo, so that researchers who rely on or want to refer to specific versions can do so in their own future publications.

      In the introduction of the paper, the authors conflate the clinical challenges of variant classification with evidence generation and it's quite muddled together. The y should strongly consider splitting the first paragraph into two paragraphs - one about challenges in variant classification/clinical genetics/precision oncology and another about variant effect prediction and experimental methods. The authors should also note that they are many predictors other than AlphaMissense, and may want to cite the ClinGen recommendations (PMID: 36413997) in the intro instead.

      Also in the introduction on lines 21-22 the authors assert that "a mechanistic understanding of variant effects is essential knowledge" for a variety of clinical outcomes. While this is nice, it is clearly not the case as we are able to classify variants according to the ACMG/AMP guidelines without any notion of specific mechanism (for example, by combining population frequency data, in silico predictor data, and functional assay data). The authors should revise the statement so that it's clear that mechanistic understanding is a worthy aspiration rather than a prerequisite.

      In the structural analysis section (page 5, lines 154-155 and elsewhere), the authors define cutoffs with convenient round numbers. Is there a citation for these values or were these arbitrarily chosen by the authors? I would have liked to see some justification that these assignments are reasonable. Also there seems to be an error in the text where values between -2 and -3 kcal/mol are not assigned to a bin (I assume they should also be uncertain). There are other similar seemingly-arbitrary cutoffs later in the section that should also be explained.

      On page 9, lines 294-298 the authors talk about using the PTEN data from ProteinGym, rather than the actual cutoffs from the paper. They get to the latter later on, but I'm not sure why this isn't first? The ProteinGym cutoffs are somewhat arbitrarily based on the median rather than expert evaluation of the dataset and I'm not sure why it's even worth mentioning them when proper classifications are available. Regarding PTEN, it would be quite interesting to see a comparison of the VAMP-seq PTEN data and the Mighell phosphatase assay, which is cited on page 9 line 288 but is not actually a VAMP-seq dataset. I think this section could be interesting but it requires some additional attention.

      The authors mention "pathogenicity predictors" and otherwise use pathogenicity incorrectly throughout the manuscript. Pathogenicity is a classification for a variant after it has been curated according to a framework like the ACMG/AMP guidelines (Richards 2015 and amendments). A single tool cannot predict or assign pathogenicity - the AlphaMissense paper was wrong to use this nomenclature and these authors should not compound this mistake. These predictors should be referred to as "variant effect predictors" or similar, and they are able to produce evidence towards pathogenicity or benignity but not make pathogenicity calls themselves. For example, in Figure 4e, the terms "pathogenic" and "benign" should only be used here if these are the classifications the authors have derived from ClinVar or a similar source of clinically classified variants.

      Minor comments:

      The target selection table on the website needs some kind of text filtering option. It's very tedious to have to find a protein by scrolling through the table rather than typing in the symbol. This will only get worse as more datasets are added.

      The data sources listed on the data usage section of the website are not concordant with what is in the paper. For example, MaveDB is not listed.

      I found Figure 2 to be a bit confusing in that it partially interleaves results from two different proteins. I think this would be nicer as two separate figures, one on each protein, or just of a single protein.

      Figure 3 panel b is distractingly large and I wonder if the authors could do a little bit more with this visualization.

      Capitalization is inconsistent throughout the manuscript. For example, page 9 line 288 refers to VampSEQ instead of VAMP-seq (although this is correct elsewhere). MaveDB is referred to as MAVEdb or MAVEDB in various places. AlphaMissense is referred to as Alphamissense in the Figure 5 legend. The authors should make a careful pass through the manuscript to address this kind of issues.

      MaveDB has a more recent paper (PMID: 39838450) that should be cited instead of/in addition to Esposito et al.

      On page 11, lines 338-339 the authors mention some interesting proteins including BLC2, which has base editor data available (PMID: 35288574). Are there plans to incorporate this type of functional assay data into MAVISp?

      Significance

      General assessment:

      This is a nice resource and the authors have clearly put a lot of effort in. They should be celebrated for their achievments in curating the diverse datasets, and the GitBooks are a nice approach. However, I wasn't able to get the website to work and I have raised several issues with the paper itself that I think should be addressed.

      Advance:

      New ways to explore and integrate complex data like protein structures and variant effects are always interesting and welcome. I appreciate the effort towards manual curation of datasets. This work is very similar in theme to existing tools like Genomics 2 Proteins portal (PMID: 38260256) and ProtVar (PMID: 38769064). Unfortunately as I wasn't able to use the site I can't comment further on MAVISp's position in the landscape.

      Audience:

      MAVISp could appeal to a diverse group of researchers who are interested in the biology or biochemistry of proteins that are included, or are interested in protein variants in general either from a computational/machine learning perspective or from a genetics/genomics perspective.

      My expertise:

      I am an expert in high-throughput functional genomics experiments and am an experienced computational biologist with software engineering experience.

    3. Note: This preprint has been reviewed by subject experts for Review Commons. Content has not been altered except for formatting.

      Learn more at Review Commons


      Referee #2

      Evidence, reproducibility and clarity

      Summary:

      The authors present a pipeline and platform, MAVISp, for aggregating, displaying and analysis of variant effects with a focus on reclassification of variants of uncertain clinical significance and uncovering the molecular mechanisms underlying the mutations.

      Major comments:

      • On testing the platform, I was unable to look-up a specific variant in ADCK1 (rs200211943, R115Q). I found that despite stating that the mapped refseq ID was NP_001136017 in the HGVSp column, it was actually mapped to the canonical UniProt sequence (Q86TW2-1). NP_001136017 actually maps to Q86TW2-3, which is missing residues 74-148 compared to the -1 isoform. The Uniprot canonical sequence has no exact RefSeq mapping, so the HGVSp column is incorrect in this instance. This mapping issue may also affect other proteins and result in incorrect HGVSp identifiers for variants.
      • The paper lacks a section on how to properly interpret the results of the MAVISp platform (the case-studies are useful, but don't lay down any global rules for interpreting the results). For example: How should a variant with conflicts between the variant impact predictors be interpreted? Are certain indicators considered more 'reliable' than others?
      • In the Methods section, GEMME is stated as being rank-normalised with 0.5 as a threshold for damaging variants. On checking the data downloaded from the site, GEMME was not rank-normalised but rather min-max normalised. Furthermore, Supplementary text S4 conflicts with the methods section over how GEMME scores are classified, S4 states that a raw-value threshold of -3 is used.
      • Note. This is a major comment as one of the claims is that the associated web-tool is user-friendly. While functional, the web app is very awkward to use for analysis on any more than a few variants at once.
        • The fixed window size of the protein table necessitates excessive scrolling to reach your protein-of-interest. This will also get worse as more proteins are added. Suggestion: add a search/filter bar.
        • The same applies to the dataset window.
        • You are unable to copy anything out of the tables.
        • Hyperlinks in the tables only seem to work if you open them in a new tab or window.
        • All entries in the reference column point to the MAVISp preprint even when data from other sources is displayed (e.g. MAVE studies).
        • Entering multiple mutants in the "mutations to be displayed" window is time-consuming for more than a handful of mutants. Suggestion: Add a box where multiple mutants can be pasted in at once from an external document.

      Minor comments

      • Grammar. I appreciate that this manuscript may have been compiled by a non-native English speaker, but I would be remiss not to point out that there are numerous grammar errors throughout, usually sentence order issues or non-pluralisation. The meaning of the authors is mostly clear, but I recommend very thoroughly proof-reading the final version.
      • There are numerous proteins that I know have high-quality MAVE datasets that are absent in the database e.g. BRCA1, HRAS and PPARG.
      • Checking one of the existing MAVE datasets (KRAS), I found that the variants were annotated as damaging, neutral or given a positive score (these appear to stand-in for gain-of-function variants). For better correspondence with the other columns, those with positive scores could be labelled as 'ambiguous' or 'uncertain'.
      • Numerous thresholds are defined for stabilizing / destabilizing / neutral variants in both the STABILITY and the LOCAL_INTERACTION modules. How were these thresholds determined? I note that (PMC9795540) uses a ΔΔG threshold of 1/-1 for defining stabilizing and destabilizing variants, which is relatively standard (though they also say that 2-3 would likely be better for pinpointing pathogenic variants).
      • "Overall, with the examples in this section, we illustrate different applications of the MAVISp results, spanning from benchmarking purposes, using the experimental data to link predicted functional effects with structural mechanisms or using experimental data to validate the predictions from the MAVISp modules."

      The last of these points is not an application of MAVISp, but rather a way in which external data can help validate MAVISp results. Furthermore, none of the examples given demonstrate an application in benchmarking (what is being benchmarked?). - Transcription factors section. This section describes an intended future expansion to MAVISp, not a current feature, and presents no results. As such, it should probably be moved to the conclusions/future directions section. - Figures. The dot-plots generated by the web app, and in Figures 4, 5 and 6 have 2 legends. After looking at a few, it is clear that the lower legend refers to the colour of the variant on the X-axis - most likely referencing the ClinVar effect category. This is not, however, made clear either on the figures or in the app. - "We identified ten variants reported in ClinVar as VUS (E102K, H86D, T29I, V91I, P2R, L44P, L44F, D56G, R11L, and E25Q, Fig.5a)"

      E25Q is benign in ClinVar and has had that status since first submitted.

      Significance

      Platforms that aggregate predictors of variant effect are not a new concept, for example dbNSFP is a database of SNV predictions from variant effect predictors and conservation predictors over the whole human proteome. Predictors such as CADD and PolyPhen-2 will often provide a summary of other predictions (their features) when using their platforms. MAVISp's unique angle on the problem is in the inclusion of diverse predictors from each of its different moules, giving a much wider perspective on variants and potentially allowing the user to identify the mechanistic cause of pathogenicity. The visualisation aspect of the web app is also a useful addition, although the user interface is somewhat awkward. Potentially the most valuable aspect of this study is the associated gitbook resource containing reports from biocurators for proteins that link relevant literature and analyse ClinVar variants. Unfortunately, these are only currently available for a small minority of the total proteins in the database with such reports.

      For improvement, I think that the paper should focus more on the precise utility of the web app / gitbook reports and how to interpret the results rather than going into detail about the underlying pipeline.

      In terms of audience, the fast look-up and visualisation aspects of the web-platform are likely to be of interest to clinicians in the interpretation of variants of unknown clinical significance. The ability to download the fully processed dataset on a per-protein database would be of more interest to researchers focusing on specific proteins or those taking a broader view over multiple proteins (although a facility to download the whole database would be more useful for this final group).

      My expertise.

      • I am a protein bioinformatician with a background in variant effect prediction and large-scale data analysis.
    1. Note: This response was posted by the corresponding author to Review Commons. The content has not been altered except for formatting.

      Learn more at Review Commons


      Reply to the reviewers

      __Reviewer #1 (Evidence, reproducibility and clarity (Required)): __

      This study explores chromatin organization around trans-splicing acceptor sites (TASs) in the trypanosomatid parasites Trypanosoma cruzi, T. brucei and Leishmania major. By systematically re-analyzing MNase-seq and MNase-ChIP-seq datasets, the authors conclude that TASs are protected by an MNase-sensitive complex that is, at least in part, histone-based, and that single-copy and multi-copy genes display differential chromatin accessibility. Altogether, the data suggest a common chromatin landscape at TASs and imply that chromatin may modulate transcript maturation, adding a new regulatory layer to an unusual gene-expression system.

      I value integrative studies of this kind and appreciate the careful, consistent data analysis the authors implemented to extract novel insights. That said, several aspects require clarification or revision before the conclusions can be robustly supported. My main concerns are listed below, organized by topic/result section.

      TAS prediction * Why were TAS predictions derived only from insect-stage RNA-seq data? Restricting TAS calls to one life stage risks biasing predictions toward transcripts that are highly expressed in that stage and may reduce annotation accuracy for lowly expressed or stage-specific genes. Please justify this choice and, if possible, evaluate TAS robustness using additional transcriptomes or explicitly state the limitation.

      TAS predictions derived only from insect-stage RNA-seq data because in a previous study it was shown that there are no significant differences between stages in the 5’UTR procesing in T. cruzi life stages (https://doi.org/10.3389/fgene.2020.00166) We are not testing an additional transcriptome here, because the robustness of the software was already probed in the original article were UTRme was described (Radio S, 2018 doi:10.3389/fgene.2018.00671).

      Results - "There is a distinctive average nucleosome arrangement at the TASs in TriTryps": * You state that "In the case of L. major the samples are less digested." However, Supplementary Fig. S1 suggests that replicate 1 of L. major is less digested than the T. brucei samples, while replicate 2 of L. major looks similarly digested. Please clarify which replicates you reference and correct the statement if needed.

      The reviewer has a good point. We made our statement based on the value of the maximum peak of the sequenced DNA molecules, which in general is a good indicative of the extension of the digestion achieved by the sample (Cole H, NAR, 2011).

      As the reviewer correctly points, we should have also considered the length of the DNA molecules in each percentile. However, in this case both, T. brucei’s and L major’s samples were gel purified before sequencing and it is hard to know exactly what fragments were left behind in each case. Therefore, it is better not to over conclude on that regard.

      We have now comment on this in the main manuscript, and we have clarified in the figure legends which data set we used in each case.

      * It appears you plot one replicate in Fig. 1b and the other in Suppl. Fig. S2. Please indicate explicitly which replicate is in each plot. For T. brucei, the NDR upstream of the TAS is clearer in Suppl. Fig. S2 while the TAS protection is less prominent; based on your digestion argument, this should correspond to the more-digested replicate. Please confirm.

      The replicates used for the construction of each figure are explicitly indicated in Table S1. Although we have detailed in the table the original publication, the project and accession number for each data set, the reviewer is correct that in this case it was still not completely clear to which length distribution heatmap was each sample associated with. To avoid this confusion, we have now added the accession number for each data set to the figure legends and also clarified in Table S1. Regarding the reviewer’s comment on the correspondence between the observed TAS protection and the extent of samples digestion, he/she is correct that for a more digested sample we would expect a clearer NDR. In this case, the difference in the extent of digestion between these two samples is minor, as observed the length of the main peak in the length distribution histogram for sequenced DNA molecules is the same. These two samples GSM5363006, represented in Fig1 b, and GSM5363007, represented in S2, belong to the same original paper (Maree et al 2017), and both were gel purified before sequencing. Therefore, any difference between them could not only be the result of a minor difference in the digestion level achieved in each experiment but could be also biased by the fragments included or not during gel purification. Therefore, I would not over conclude about TAS protection from this comparison. We have now included a brief comment on this, in the figure discussion

      * The protected region around the TAS appears centered on the TAS in T. brucei but upstream in L. major. This is an interesting difference. If it is technical (different digestion or TAS prediction offset), explain why; if likely biological, discuss possible mechanisms and implications.

      We appreciate the reviewer suggestion. We cannot assure if it is due to technical or biological reasons, but there is evidence that L. major ‘s genome has a different dinucleotide content and it might have an impact on nucleosome assembly. We have now added a comment about this observation in the final discussion of the manuscript.

      Results - "An MNase sensitive complex occupies the TASs in T. brucei": * The definition of "MNase activity" and the ordering of samples into Low/Intermediate/High digestion are unclear. Did you infer digestion levels from fragment distributions rather than from controlled experimental timepoints? In Suppl. Fig. S3a it is not obvious how "Low digestion" was defined; that sample's fragment distribution appears intermediate. Please provide objective metrics (e.g., median fragment length, fraction 120-180 bp) used to classify digestion levels.

      As the reviewer suggests, the ideal experiment would be to perform a time course of MNase reaction with all the samples in parallel, or to work with a fixed time point adding increasing amounts of MNase. However, even when making controlled experimental timepoints, you need to check the length distribution histogram of sequenced DNA molecules to be sure which level of digestion you have achieved.

      In this particular case, we used public available data sets to make this analysis. We made an arbitrary definition of low, intermediate and high level of digestion, not as an absolute level of digestion, but as a comparative output among the tested samples. We based our definition on the comparison of __the main peak in length distribution heatmaps because this parameter is the best metric to estimate the level of digestion of a given sample. It represents the percentage of the total DNA sequenced that contains the predominant length in the sample tested. __Hence, we considered:

      low digestion: when the main peak is longer than the expected protection for a nucleosome (longer than 150 bp). We expect this sample to contain additional longer bands that correspond to less digested material.

      intermediate digestion, when the main peak is the expected for the nucleosome core-protection (˜146-150bp).

      high digestion, when the main peak is shorter than that (shorter than 146 bp). This case, is normally accompanied by a bigger dispersion in fragment sizes.

      To do this analysis, we chose samples that render different MNase protection of the TAS when plotting all the sequenced DNA molecules relative to this point and we used this protection as a predictor of the extent of sample digestion (Figure 2). To corroborate our hypothesis, that the degree of TAS protection was indeed related to the extent of the MNase digestion of a given sample, we looked at the length distribution histogram of the sequenced DNA molecules in each case. It is the best measurement of the extent of the digestion achieved, especially, when sequencing the whole sample without any gel purification and representing all the reads in the analysis as we did. The only caveat is with the sample called “intermediate digestion 1” that belongs to the original work of Mareé 2017, since only this data set was gel purified.

      Whether the sample used in Figure 1 (from Mareé 2017) is also from the same lab and is an MNase-seq. Strictly speaking, there is no methodological difference between MNase-seq and the input of a native MNase-ChIP-seq, since the input does not undergo the IP.

      * Several fragment distributions show a sharp cutoff at ~100-125 bp. Was this due to gel purification or bioinformatic filtering? State this clearly in Methods. If gel purification occurred, that can explain why some datasets preserve the MNase-sensitive region.

      The sharp cutoff is neither due to gel purification or bioinformatic filtering, it is just due to the length of the paired-end read used in each case. In earlier works the most common was to sequence only 50bp, with the improvement of technologies it went up to 75,100 or 125 bp. We have now clarified in Table S1 the length of the paired-reads used in each case when possible.

      * Please reconcile cases where samples labeled as more-digested contain a larger proportion of >200 bp fragments than supposedly less-digested samples; this ordering affects the inference that digestion level determines the loss/preservation of TAS protection. Based on the distributions I see, "Intermediate digestion 1" appears most consistent with an expected MNase curve - please confirm and correct the manuscript accordingly.

      As explained above, it's a common observation in MNase digestion of chromatin that more extensive digestion can still result in a broad range of fragment sizes, including some longer fragments. This seemingly counter-intuitive result is primarily due to the non-uniform accessibility of chromatin and the sequence preference of the MNase enzyme, which has a preference for AT reach sequences.

      The rationale of this is as follows: when you digest chromatin with MNase and the objective is to map nucleosomes genome-wide, the ideal situation would be to get the whole material contained in the mononucleosome band. Given that MNase is less efficient to digest protected DNA but, if the reaction proceeds further, it always ends up destroying part of it, the result is always far from perfect. The better situation we can get, is to obtain samples were ˜80% of the material is contained in the mononucloesome band. __And here comes the main point: __even in the best scenario, you always get some additional longer bands, such as those for di or tri nucleosomes. If you keep digesting, you will get less than 80 % in the nucleosome band and, those remaining DNA fragments that use to contain di and tri nucleosomes start getting digested as well, originating a bigger dispersion in fragments sizes. How do we explain persistence of Long Fragments? The longest fragments (di-, tri-nucleosomes) that persist in a highly digested sample are the ones that were originally most highly protected by proteins or higher-order structure, or by containing a poor AT sequence content, making their linker DNA extremely resistant to initial cleavage. Once the majority of the genome is fragmented, these few resistant longer fragments become a more visible component of the remaining population, contributing to a broader size dispersion. Hence, you end up observing a bigger dispersion in length distributions in the final material. Bottom line, it is not a good practice to work with under or over digested samples. Our main point, is to emphasize that especially when comparing samples, it important to compare those with comparable levels of digestion. Otherwise, a different sampling of the genome will be represented in the remaining sequenced DNA.

      Results - "The MNase sensitive complexes protecting the TASs in T. brucei and T. cruzi are at least partly composed of histones": * The evidence that histones are part of the MNase-sensitive complex relies on H3 MNase-ChIP signal in subnucleosomal fragment bins. This seems to conflict with the observation (Fig. 1) that fragments protecting TASs are often nucleosome-sized. Please reconcile these points: are H3 signals confined to subnucleosomal fragments flanking the TAS while the TAS itself is depleted of H3? Provide plots that compare MNase-seq and H3 ChIP signals stratified by consistent fragment-size bins to clarify this.

      What we learned from other eukaryotic organisms that were deeply studied, such as yeast, is that NDRs are normally generated at regulatory points in the genome. In this sense, yeast tRNA genes have a complex with a bootprint smaller than a nucleosome formed by TFIIIC-TFIIB (Nagarajavel, doi: 10.1093/nar/gkt611). On the other hand, many promotor regions have an MNase-sensitive complex with a nucleosome-size footprint, but it does not contain histones (Chereji, et al 2017, doi:10.1016/j.molcel.2016.12.009). The reviewer is right that from Figure 1 and S2 we could observe that the footprint of whatever occupies the TAS region, especially in T. brucei, is nucleosome-size. However, it only shows the size, but it doesn’t prove the nature of its components. Nevertheless, those are only MNase-seq data sets. Since it does not include a precipitation with specific antibodies, we cannot confirm the protecting complex is made up by histones. In parallel, a complementary study by Wedel 2017, from Siegel’s lab, shows that using a properly digested sample and further immunoprecipitating with a-H3 antibody, the TAS is not protected by nucleosomes at least not when analyzing nucleosome size-DNA molecules. Besides, Briggs et. al 2018 (doi: 10.1093/nar/gky928) showed that at least at intergenic regions H3 occupancy goes down while R-loops accumulation increases. We have now added a supplemental figure associated to Figure 3 (new Suplemental 5) replotting R-loops and MNase-ChIP-seq for H3 relative to our predicted TAS showing this anti-correlation and how it partly correlates with MNase protection as well. As a control we show that Rpb9 trends resembles H3 as Siegel’s lab have shown in Wedel 2018.

      * Please indicate which datasets are used for each panel in Suppl. Fig. S4 (e.g., Wedel et al., Maree et al.), and avoid calling data from different labs "replicates" unless they are true replicates.

      In most of our analysis we used real replicated experiments. Such is the case MNase-seq data used in Figure 1, with the corresponding replicate experiments used in Figure S2; T. cruzi MNase-ChIP-seq data used in Figure 3b and 4a with the respective replicate used in Figures S4 and S5 (now S6 in the revised manuscript). The only case in which we used experiments coming from two different laboratories, is in the case of MNase-ChIP-seq for H3 from T. brucei. Unfortunately, there are only two public data sets coming each of them from different laboratories. The samples used in Fig 3 (from Siegel’s lab) whether the IP from H3 represented in S4 and S5 (S6 n the updated version) comes from another lab (Patterton’s). To be more rigorous, we now call them data 1 and 2 when comparing these particular case.

      The reviewer is right that in this particular case one is native chromatin (Pattertons’) while the other one is crosslinked (Siegel’s). We have now clarified it in the main text that unfortunately we do not count on a replicate but even under both condition the result remains the same, and this is compatible with my own experience, were crosslinking does not affect the global nucleosome patterns (compared nucleosome organization from crosslinked chromatin MNAse-seq inputs Chereji, Mol Cell, 2017 doi: 10.1016/j.molcel.2016.12.009 and native MNase-seq from Ocampo, NAR, 2016 doi: 10.1093/nar/gkw068).

      * Several datasets show a sharp lower bound on fragment size in the subnucleosomal range (e.g., ~80-100 bp). Is this a filtering artifact or a gel-size selection? Clarify in Methods and, if this is an artifact, consider replotting after removing the cutoff.

      We have only filtered adapter dimmer or overrepresented sequences when needed. In Figures 2 and S3 we represented all the sequenced reads. In other figures when we sort fragments sizes in silico, such as nucleosome range, dinucleosome or subnucleosome size, we make a note in the figure legends. What the reviewer points is related to the length of the sequence DNA fragment in each experiment. As we explained above, the older data-sets were performed with 50 bp paired-end reads, the newer ones are 75, 100 or 125bp. This is information is now clarified in Table S1.

      __Results - "The TASs of single and multi-copy genes are differentially protected by nucleosomes": __

      __ __* Please include T. brucei RNA-seq data in Suppl. Fig. S5b as you did for T. cruzi.

      We have shown chromatin organization for T. brucei in S5b to show that there is a similar trend. Unfortunately, we did not get a robust list of multi-copy genes for T. brucei as we did get for T. cruzi, therefore we do not want to over conclude showing the RNA-seq for these subsets of genes. The limitation is related to the fact that UTRme restrict the search and is extremely strict when calling sites at repetitive regions.

      * Discuss how low or absent expression of multigene families affects TAS annotation (which relies on RNA-seq) and whether annotation inaccuracies could bias the observed chromatin differences.

      The mapping of occurrence and annotations that belong to repetitive regions has great complexity. UTRme is specially designed to avoid overcalling those sites. In other words, there is a chance that we could be underestimating the number of predicted TASs at multi-copy genes. Regarding the impact on chromatin analysis, we cannot rule out that it might have an impact, but the observation favors our conclusion, since even when some TASs at multi-copy genes can remain elusive, we observe more nucleosome density at those places.

      * The statement that multi-copy genes show an "oscillation" between AT and GC dinucleotides is not clearly supported: the multi-copy average appears noisier and is based on fewer loci. Please tone down this claim or provide statistical support that the pattern is periodic rather than noisy.

      We have fixed this now in the preliminary revised version

      * How were multi-copy genes defined in T. brucei? Include the classification method in Methods.

      This classification was done the same way it was explained for T. cruzi

      Genomes and annotations: * If transcriptomic data for the Y strain was used for T. cruzi, please explain why a Y strain genome was not used (e.g., Wang et al. 2021 GCA_015033655.1), or justify the choice. For T. brucei, consider the more recent Lister 427 assembly (Tb427_2018) from TriTrypDB. Use strain-matched genomes and transcriptomes when possible, or discuss limitations.

      The most appropriate way to analyze high throughput data, is to aline it to the same genome were the experiments were conducted. This was clearly illustrated in a previous publication from our group were we explained how should be analyzed data from the hybrid CL Brener strain. A common practice in the past was to use only Esmeraldo-like genome for simplicity, but this resulted in output artifacts. Therefore, we aligned it to CL Brener genome, and then focused the main analysis on the Esmeraldo haplotype (Beati Plos ONE, 2023). Ideally, we should have counted on transcriptomic data for the same strain (CL Brener or Esmeraldo). Since this was not the case at that moment, we used data from Y strain that belongs to the same DTU with Esmeraldo.

      In the case of T. brucei, when we started our analysis and the software code for UTRme was written, the previous version of the genome was available. Upon 2018 version came up, we checked chromatin parameters and observed that it did not change the main observations. Therefore, we continue working with our previous setups.

      Reproducibility and broader integration: * Please share the full analysis pipeline (ideally on GitHub/Zenodo) so the results are reproducible from raw reads to plots.

      We are preparing a full pipeline in GitHub. We will make it available before manuscript full revision

      * As an optional but helpful expansion, consider including additional datasets (other life stages, BSF MNase-seq, ATAC-seq, DRIP-seq) where available to strengthen comparative claims.

      We are now including a new suplemental figure including DRIP-seq and Rp9 ChIP-seq (revised S5). Additionally, we added a new panel c to figure 4, representing FAIRE-seq data for T. cruzi fore single and multi-copy genes

      We are working on ATAC-seq analysis and BSF MNase-seq

      Optional analyses that would strengthen the study: * Stratify single-copy genes by expression (high / medium / low) and examine average nucleosome occupancy at TASs for each group; a correlation between expression and NDR depth would strengthen the functional link to maturation.

      We have now included a panel in suplemental figure 5 (now revised S6), showing the concordance for chromatin organization of stratified genes by RNA-seq levels relative to TAS.

      __Minor / editorial comments: __ * In the Introduction, the sentence "transcription is initiated from dispersed promoters and in general they coincide with divergent strand switch regions" should be qualified: such initiation sites also include single transcription start regions.

      We have clarified this in the preliminary revised version

      * Define the dotted line in length distribution plots (if it is not the median, please clarify) and consider placing it at 147 bp across plots to ease comparison.

      The dotted line is just to indicate where the maximum peak is located. It is now clarified in figure legends.

      * In Suppl. Fig. 4b "Replicate2" the x-axis ticks are misaligned with labels - please fix.

      We have now fixed the figure. Thanks for noticing this mistake.

      * Typo in the Introduction: "remodellingremodeling" → "remodeling

      Thanks for noticing this mistake, it is fixed in the current version of the manuscript

      **Referee cross-commenting** Comment 1: I think Reviewer #2 and Reviewer #3 missed that they authors of this manuscript do cite and consider the results from Wedel at al. 2017. They even re-analysed their data (e.g. Figure 3a). I second Reviewer #2 comment indicating that the inclusion of a schematic figure to help readers visualize and better understand the findings would be an important addition.

      Comment 2: I agree with Reviewer #3 that the use of different MNase digestion procedures in the different datasets have to be considered. On the other hand, I don't think there is a problem with figure 1 showing an MNase-protected TAS for T. brucei as it is based on MNase-seq data and reproduces the reported results (Maree et al. 2017). What the Siegel lab did in Wedel et al. 2017 was MNase-ChIPseq of H3 showing nucleosome depletion at TAS, but both results are not necessary contradictory: There could still be something else (which does not contain H3) sitting on the TAS protecting it from MNase digestion.

      Reviewer #1 (Significance (Required)):

      This study provides a systematic comparative analysis of chromatin landscapes at trans-splicing acceptor sites (TASs) in trypanosomatids, an area that has been relatively underexplored. By re-analyzing and harmonizing existing MNase-seq and MNase-ChIP-seq datasets, the authors highlight conserved and divergent features of nucleosome occupancy around TASs and propose that chromatin contributes to the fidelity of transcript maturation. The significance lies in three aspects: 1. Conceptual advance: It broadens our understanding of gene regulation in organisms where transcription initiation is unusual and largely constitutive, suggesting that chromatin can still modulate post-transcriptional processes such as trans-splicing. 2. Integrative perspective: Bringing together data from T. cruzi, T. brucei and L. major provides a comparative framework that may inspire further mechanistic studies across kinetoplastids. 3. Hypothesis generation: The findings open testable avenues about the role of chromatin in coordinating transcript maturation, the contribution of DNA sequence composition, and potential interactions with R-loops or RNA-binding proteins. Researchers in parasitology, chromatin biology, and RNA processing will find it a useful resource and a stimulus for targeted experimental follow-up.

      My expertise is in gene regulation in eukaryotic parasites, with a focus on bioinformatic analysis of high-throughput sequencing data

      __Reviewer #2 (Evidence, reproducibility and clarity (Required)): __

      Siri et al. perform a comparative analysis using publicly available MNase-seq data from three trypanosomatids (T. brucei, T. cruzi, and Leishmania), showing that a similar chromatin profile is observed at TAS (trans-splicing acceptor site) regions. The original studies had already demonstrated that the nucleosome profile at TAS differs from the rest of the genome; however, this work fills an important gap in the literature by providing the most reliable cross-species comparison of nucleosome profiles among the tritryps. To achieve this, the authors applied the same computational analysis pipeline and carefully evaluated MNase digestion levels, which are known to influence nucleosome profiling outcomes.

      In my view, the main conclusion is that the profiles are indeed similar-even when comparing T. brucei and T. cruzi. This was not clear in previous studies (and even appeared contradictory, reporting nucleosome depletion versus enrichment) largely due to differences in chromatin digestion across these organisms. The manuscript could be improved with some clarifications and adjustments:

      1. The authors state from the beginning that available MNase data indicate altered nucleosome occupancy around the TAS. However, they could also emphasize that the conclusions across the different trypanosomatids are inconsistent and even contradictory: NDR in T. cruzi versus protection-in different locations-in T. brucei and Leishmania.

      We start our manuscript by referring to the first MNase-seq data sets publicly available for each TriTryp and we point that one of the main observations, in each of them, is the occurrence of a change in nucleosome density or occupancy at intergenic regions. In T. cruzi, in a previous publication from our group, we stablished that this intergenic drop in nucleosome density occurs near the trans-splicing acceptor site. In this work, we extend our study to the other members of TriTryps: T. brucei and L. major.

      In T. brucei the papers from Patterton’s lab and Siegel’s lab came out almost simultaneously in 2017. Hence, they do not comment on each other’s work. The first one claims the presence of a well-positioned nucleosome at the TAS by using MNase-seq, while the second one, shows an NDR at the TAS by using MNase-ChIP-seq. However, we do not think they are contradictory, or they have inconsistency. We brought them together along the manuscript because we think these works can provide complementary information.

      On one hand, we infer data from Pattertons lab is slightly less digested than the sample from Siegel’s lab. Therefore, we discuss that this moderate digestion must be the reason why they managed to detect an MNase protecting complex sitting at the TAS (Figure 1). On the other hand, Sigel’s lab includes an additional step by performing MNase-ChIP-seq, showing that when analyzing nucleosome size fragments, histones are not detected at the TAS. Here, we go further in this analysis on figure 3, showing that only when looking at subnucleosome-size fragments, we are able to detect histone H3. And this is also true for T. cruzi.

      By integrating every analysis in this work and the previous ones, we propose that TASs are protected by an MNase-sensitive complex (probed in Figure 2). This complex most likely is only partly formed by histones, since only when analyzing sub-nucleosomes size DNA molecules we can detect histone H3 (Figure 3). To be absolutely sure that the complex is not entirely made up by histones, future studies should perform an MNse-ChIP-seq with less digested samples. However, it was previously shown that R-loops are enriched at those intergenic NDRs (Briggs, 2018 doi: 10.1093/nar/gky928) and that R-loops have plenty of interacting proteins (Girasol, 2023 10.1093/nar/gkad836). Therefore, most likely, this MNase-sensitive complexed have a hybrid nature made up by H3 and some other regulatory molecules, possibly involved in trans-splicing. We have now added a new figure S5 showing R-loop co-localization with the NDR.

      Regarding the comparison between different organisms, after explaining the sensitivity to MNase of the TAS protecting complex, we discuss that when comparing equally digested samples T. cruzi and T. brucei display a similar chromatin landscape with a mild NDR at the TAS (See T. cruzi represented in Figure 1 compared to T. brucei represented in Intermediate digestion 2 in Figure 2, intermediate digestion in the revised manuscript). Unfortunately, we cannot make a good comparison with L. major, since we do not count on a similar level of digestion.

      Another point that requires clarification concerns what the authors mean in the introduction and discussion when they write that trypanosomes have "...poorly organized chromatin with nucleosomes that are not strikingly positioned or phased." On the other hand, they also cite evidence of organization: "...well-positioned nucleosome at the spliced-out region.. in Leishmania (ref 34)"; "...a well-positioned nucleosome at the TASs for internal genes (ref37)"; "...a nucleosome depletion was observed upstream of every gene (ref 35)." Aren't these examples of organized chromatin with at least a few phased nucleosomes? In addition, in ref 37, figure 4 shows at least two (possibly three to four) nucleosomes that appear phased. In my opinion, the authors should first define more precisely what they mean by "poorly organized chromatin" and clarify that this interpretation does not contradict the findings highlighted in the cited literature.

      For a better understanding of nucleosome positioning and phasing I recommend the review: Clark 2010 doi:10.1080/073911010010524945, Figure 4. Briefly, in a cell population there are different alternative positions that a given nucleosome can adopt. However, some are more favorable. When talking about favorable positions, we refer to the coordinates in the genome that are most likely covered by a nucleosome and are predominant in the cell population. Additionally, nucleosomes could be phased or not. This refers not only the position in the genome, but to the distance relative to a given point. In yeast, or in highly transcribed genes of more complex eukaryotes, nucleosomes are regularly spaced and phased relative to the transcription start site (TSS) or to the +1 nucleosome (Ocampo, NAR, 2016, doi:10.1093/nar/gkw068). In trypanosomes, nucleosomes have some regular distribution when making a browser inspection but, given that they are not properly phased with respect to any point, it is almost impossible to make a spacing estimation from paired-end data. This is also consistent with a chromatin that is transcribed in an almost constitutive manner.

      As the reviewer mention, we do site evidence of organization. We think the original observations are correct, but we do not fully agree with some of the original statements. In this manuscript our aim is to take the best we learned from their original works and to make a constructive contribution adding to the original discussions. In this regard, in trypanosomes there are some conserved patterns in the chromatin landscape, but their nucleosomes are far from being well-positioned or phased. For a better understanding, compare the variations observed in the y axis when representing av. nucleosome occupancy in yeast with those observed in trypanosomes and you will see that the troughs and peaks are much more prominent in yeast than the ones observed in any TryTryp member.

      Following the reviewer’s suggestion we have now clarified this in the main text

      The paper would also benefit from the inclusion of a schematic figure to help readers visualize and better understand the findings. What is the biological impact of having nucleosomes, di-nucleosomes, or sub-nucleosomes at TAS? This is not obvious to readers outside the chromatin field. For example, the following statement is not intuitive: "We observed that, when analyzing nucleosome-size (120-180 bp) DNA molecules or longer fragments (180-300 bp), the TASs of either T. cruzi or T. brucei are mostly nucleosome-depleted. However, when representing fragments smaller than a nucleosome-size (50-120 bp) some histone protection is unmasked (Fig. 3 and Fig. S4). This observation suggests that the MNase sensitive complex sitting at the TASs is at least partly composed of histones." Please clarify.

      We appreciate the reviewer’s suggestion to make a schematic figure. We are working on this and will be added to the manuscript upon final revision.

      Regarding the biological impact of having mono, di or subnucleosome fragments, it is important to unveil the fragment size of the protected DNA to infer the nature of the protecting complex. In the case of tRNA genes in yeast, at pol III promoters they found footprints smaller than a nucleosome size that ended up being TFIIB-TFIIC (Nagarajavel, doi: 10.1093/nar/gkt611). Therefore, detecting something smaller than a nucleosome might suggest the binding of trans-acting factors different than histones or involving histones in a mixed complex. These mixed complexes are also observed, and that is the case of the centromeric nucleosome which has a very peculiar composition (Ocampo and Clark, Cells Reports, 2015). On the other hand, if instead we detect bigger fragments, it could be indicative of the presence of bigger protecting molecules or that those regions are part of higher order chromatin organization still inaccessible for MNase linker digestions.

      Here we show on 2Dplots, that complex or components protecting the TAS have nucleosome size, but we cannot assure they are entirely made up by histones, since, only when looking at subnucleosome-size fragments, we are able to detect histone H3. We have now added part of this explanation to the discussion.

      By integrating every analysis in this work and the previous ones, we propose that the TAS is protected by an MNase-sensitive complex (Figure 2). This complex most likely is only partly formed by histones, since only when analyzing sub-nucleosomes size DNA molecules we can detect histone H3 (Figure 3). As explained above, to be absolutely sure that the complex is not entirely made up by histones, future studies should perform an MNse-ChIP-seq with less digested samples. However, it was previously shown that R-loops are enriched at those intergenic NDRs (Briggs 2018) and that R-loops have plenty of interacting proteins (Girasol, 2023). Therefore, most likely, this MNase-sensitive complexed have a hybrid nature made up by H3 and some other regulatory molecules. We have now added a new S5 figure showing R-loop co-localization.

      Some references are missing or incorrect:

      we will make a thorough revision

      "In trypanosomes, there are no canonical promoter regions." - please check Cordon-Obras et al. (Navarro's group). Thank you for the appropiate suggestion.

      We have now added this reference

      Please, cite the study by Wedel et al. (Siegel's group), which also performed MNase-seq analysis in T. brucei.

      We understand that reviewer number 2# missed that we cited this reference and that we did used the raw data from the manuscript of Wedel et. al 2017 form Siegel’s group. We used the MNase-ChIP-seq data set of histone H3 in our analysis for Figures 3, S4b and S5b (S6c in the revised version), also detailed in table S1. To be even more explicit we have now included the accession number of each data set in the figure legend.

      Figure-specific comments: Fig. S3: Why does the number of larger fragments increase with greater MNase digestion? Shouldn't the opposite be expected?

      This a good observation. As we also explained to reviewer#1:

      It's a common observation in MNase digestion of chromatin that more extensive digestion can still result in a broad range of fragment sizes, including some longer fragments. This seemingly counter-intuitive result is primarily due to the non-uniform accessibility of chromatin and the sequence preference of the MNase enzyme.

      The rationale of this is as follows: when you digest chromatin with MNase and the objective is to map nucleosomes genome-wide, the ideal situation would to get the whole material contained in the mononucleosome band. Given that MNase is less efficient to digest protected DNA but, if the reaction proceeds further, it always ends up destroying part of it, the result is always far from perfect. The better situation we can get, is to obtain samples were ˜80% of the material is contained in the mononucloesome band. __And here comes the main point: __even in the best scenario, you always have some additional longer bands, such as those for di or tri nucleosomes. If you keep digesting, you will get less than 80 % in the nucleosome band and, those remaining DNA fragments that use to contain di and tri nucleosomes start getting digested as well originating a bigger dispersion in fragments sizes. How do we explain persistence of Long Fragments? The longest fragments (di-, tri-nucleosomes) that persist in a highly digested sample are the ones that were originally most highly protected by proteins or higher-order structure, making their linker DNA extremely resistant to initial cleavage. Once the majority of the genome is fragmented, these few resistant longer fragments become a more visible component of the remaining population, contributing to a broader size dispersion. Hence, there you end up having a bigger dispersion in length distributions in the final material. Bottom line, it is not a good practice to work with under or overdigested samples. Our main point is to emphasize that especially when comparing samples, it important to compare those with comparable levels of digestion. Otherwise, a different sampling of the genome will be represented in the remaining sequenced DNA Fig. S5B: Why not use MNase conditions under which T. cruzi and T. brucei display comparable profiles at TAS? This would facilitate interpretation.

      The reviewer made a reasonable observation. The reason why we used MNase-ChIP_seq instead of just MNase to test occupancy at TAS at the subsets of genes, is because we intended to be more certain if we were talking about the presence of histones or something else. By using IP for histone H3 we can see that at multi-copy genes this protein is present when looking at nucleosome-size fragments. Additionally, as shown in figure S4b, length distribution histograms are also similar for the compared IPs.

      Minor points:

      There are several typos throughout the manuscript.

      Thanks for the observation. We will check carefully.

      Methods: "Dinucelotide frecuency calculation."

      We will add a code in GitHub

      Reviewer #2 (Significance (Required)):

      In my view, the main conclusion is that the profiles are indeed similar-even when comparing T. brucei and T. cruzi. This was not clear in previous studies (and even appeared contradictory, reporting nucleosome depletion versus enrichment) largely due to differences in chromatin digestion across these organisms. Audience: basic science and specialized readers.

      Expertise: epigenetics and gene expression in trypanosomatids.

      __Reviewer #3 (Evidence, reproducibility and clarity (Required)): __

      The authors analysed publicly accessible MNase-seq data in TriTryps parasites, focusing on the chromatin structure around trans-splicing acceptor sites (TASs), which are vital for processing gene transcripts. They describe a mild nucleosome depletion at the TAS of T. cruzi and L. major, whereas a histone-containing complex protects the TASs of T. brucei. In the subsequent analysis of T. brucei, they suggest that a Mnase-sensitive complex is localised at the TASs. For single-copy versus multi-copy genes, the authors show different di-nucleotide patterns and chromatin structures. Accordingly, they propose this difference could be a novel mechanism to ensure the accuracy of trans-splicing in these parasites.

      Before providing an in- depth review of the manuscript, I note that some missing information would have helped in assessing the study more thoroughly; however, in the light of the available information, I provide the following comments for consideration.

      The numbering of the figures, including the figure legends, is missing in the PDF file. This is essential for assessing the provided information.

      We apologized for not including the figure numbers in the main text, although they are located in the right place when called in the text. The omission was unwillingly made when figure legends were moved to the bottom of the main text. This is now fixed in the updated version of the manuscript.

      The publicly available Mnase- seq data are manyfold, with multiple datasets available for T. cruzi, for example. It is unclear from the manuscript which dataset was used for which figure. This must be clarified.

      This was detailed in Table S1. We have now replaced the table by an improved version, and we have also included the accession number of each data set used in the figure legends.

      Why do the authors start in figure 1 with the description of an MNase- protected TAS for T.brucei, given that it has been clearly shown by the Siegel lab that there is a nucleosome depletion similar to other parasites?

      We did not want to ignore the paper from Patterton’s lab because it was the first one to map nucleosomes genome-wide in T. brucei and the main finding of that paper claimed the existence of a well-positioned nucleosome at intergenic regions, what we though constitutes a point worth to be discussed. While Patterton’s work use MNase-seq from gel-purified samples and provides replicated experiments sequenced in really good depth; Siegel’s lab uses MNase-ChIP-seq of histone H3 but performs only one experiment and its input was not sequenced. So, each work has its own caveats and provides different information that together contributes to make a more comprehensive study. We think that bringing up both data sets to the discussion, as we have done in Figures 1 and 3, helps us and the community working in the field to enrich the discussion.

      If the authors re- analyse the data, they should compare their pipeline to those used in the other studies, highlighting differences and potential improvements.

      We are working on this point. We will provide a more detail description in the final revision.

      Since many figures resemble those in already published studies, there seems little reason to repeat and compare without a detailed comparison of the pipelines and their differences.

      Following the reviewer advice, we are now working on highlighting the main differences that justify analyzing the data the way we did and will be added in the finally revised method section.

      At a first glance, some of the figures might look similar when looking at the original manuscripts comparing with ours. However, with a careful and detailed reading of our manuscripts you can notice that we have added several analyses that allow to unveil information that was not disclosed before.

      First, we perform a systematic comparison analyzing every data set the same way from beginning to end, being the main difference with previous studies the thorough and precise prediction of TAS for the three organisms. Second, we represent the average chromatin organization relative to those predicted TASs for TriTryps and discuss their global patterns. Third, by representing the average chromatin into heatmaps, we show for the very first time, that those average nucleosome landscape are not just an average, they keep a similar organization in most of the genome. These was not done in any of the previous manuscripts except for our own (Beati, PLOS One 2023). Additionally, we introduce the discussion of how the extension of MNase reaction can affect the output of these experiments and we show 2D-plots and length distribution heatmaps to discuss this point (a point completely ignored in all the chromatin literature for trypanosomes). Furthermore, we made a far-reaching analysis by considering the contributions of each publish work even when addressed by different techniques. Finally, we discuss our findings in the context of a topic of current interest in the field, such as TriTryp’s genome compartmentalization.

      Several previous Mnase- seq analysis studies addressing chromatin accessibility emphasized the importance of using varying degrees of chromatin digestion, from low to high digestion (30496478, 38959309, 27151365).

      The reviewer is correct, and this point is exactly what we intended to illustrate in figure number 2. We appreciate he/she suggests these references that we are now citing in the final discussion. Just to clarify, using varying degrees of chromatin digestion is useful to make conclusions about a given organism but when comparing samples, strains, histone marks, etc. It is extremely important to do it upon selection of similar digested samples.

      No information on the extent of DNA hydrolysis is provided in the original Mnase- seq studies. This key information can not be inferred from the length distribution of the sequenced reads.

      The reviewer is correct that “No information on the extent of DNA hydrolysis is provided in the original Mnase-seq studies” and this is another reason why our analysis is so important to be published and discussed by the scientific community working in trypanosomes. We disagree with the reviewer in the second statement, since the level of digestion of a sequenced sample is actually tested by representing the length distribution of the total DNA sequenced. It is true that before sequencing you can, and should, check the level of digestion of the purified samples in an agarose gel and/or in a bioanalyzer. It could be also tested after library preparation, but before sequencing, expecting to observe the samples sizes incremented in size by the addition of the library adapters. But, the final test of success when working with MNase digested samples is to analyze length of DNA molecules by representing the histograms with length distribution of the sequenced DNA molecules. Remarkably, on occasions different samples might look very similar when run in a gel, but they render different length distribution histograms and this is because the nucleosome core could be intact but they might have suffered a differential trimming of the linker DNA associated to it or even be chewed inside (see Cole Hope 2011, section 5.2, doi: 10.1016/B978-0-12-391938-0.00006-9, for a detailed explanation).

      As the input material are selected, in part gel- purified mono- nucleosomal DNA bands. Furthermore the datasets are not directly comparable, as some use native MNase, while others employ MNase after crosslinking; some involve short digestion times at 37 {degree sign} C, while others involve longer digestion at lower temperatures. Combining these datasets to support the idea of an MNase- sensitive complex at the TAS of T. brucei therefore may not be appropriate, and additional experiments using consistent methodologies would strengthen the study's conclusions.

      In my opinion, describing an MNase- sensitive complex based solely on these data is not feasible. It requires specifically designed experiments using a consistent method and well- defined MNase digestion kinetics.

      As the reviewer suggests, the ideal experiment would be to perform a time course of MNase reaction with all the samples in parallel, or to work with a fix time point adding increasing amounts of MNase. However, the information obtained from the detail analysis of the length distribution histogram of sequenced DNA molecules the best test of the real outcome. In fact, those samples with different digestion levels were probably not generated on purpose.

      The only data sets that were gel purified are those from Mareé 2017 (Patterton’s lab), used in Figures 1, S1 and S2 and those from L. major shown in Fig 1. It was a common practice during those years, then we learned that is not necessary to gel purify, since we can sort fragment sizes later in silico when needed.

      As we explained to reviewer #1, to avoid this conflict, we decided to remove this data from figures 2 and S3. In summary, the 3 remaining samples comes from the same lab, and belong to the same publication (Mareé 2022). These sample are the inputs of native MNase ChIp-seq, obtain the same way, totally comparable among each other.

      Reviewer #3 (Significance (Required)):

      Due to the lack of controlled MNase digestion, use of heterogeneous datasets, and absence of benchmarking against previous studies, the conclusions regarding MNase-sensitive complexes and their functional significance remain speculative. With standardized MNase digestion and clearly annotated datasets, this study could provide a valuable contribution to understanding chromatin regulation in TriTryps parasites.

      As we have explained in the previous point our conclusions are valid since we do not compare in any figure samples coming from different treatments. The only exception to this comment could be in figure 3 when talking about MNase-ChIP-seq. We have now added a clear and explicit comment in the section and the discussion that despite having subtle differences in experimental procedures we arrive to the same results. This is the case for T. cruzi IP, run from crosslinked chromatin, compared to T. brucei’s IP, run from native chromatin.

      Along the years it was observed in the chromatin field that nucleosomes are so tightly bound to DNA that crosslinking is not necessary. However, it is still a common practice specially when performing IPs. In our own hands, we did not observe any difference at the global level neither in T. cruzi or in my previous work with yeast.

      ...

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review): 

      Summary: 

      In this manuscript, the authors describe a good-quality ancient maize genome from 15th-century Bolivia and try to link the genome characteristics to Inca influence. Overall, the manuscript is below the standard in the field. In particular, the geographic origin of the sample and its archaeological context is not well evidenced. While dating of the sample and the authentication of ancient DNA have been evidenced robustly, the downstream genetic analyses do not support the conclusion that genomic changes can be attributed to Inca influence. Furthermore, sections of the manuscript are written incoherently and with logical mistakes. In its current form, this paper is not robust and possibly of very narrow interest. 

      Strengths: 

      Technical data related to the maize sample are robust. Radiocarbon dating strongly evidenced the sample age, estimated to be around 1474 AD. Authentication of ancient DNA has been done robustly. Spontaneous C-to-T substitutions, which are present in all ancient DNA, are visible in the reported sample with the expected pattern. Despite a low fraction of C-to-T at the 1st base, this number could be consistent with the cool and dry climate in which the sample was preserved. The distribution of DNA fragment sizes is consistent with expectations for a sample of this age. 

      Weaknesses: 

      Thank you for all your thoughtful comments. See below for comments on each.

      (1) Archaeological context for the maize sample is weakly supported by speculation about the origin and has unreasonable claims weighing on it. Perhaps those findings would be more convincing if the authors were to present evidence that supports their conclusions: i) a map of all known tombs near La Paz, ii) evidence supporting the stone tomb origins of this assemblage, and iii) evidence supporting non-Inca provenance of the tomb. 

      We believe we are clear about what information we have about context.  First, the intake records from the MSU Museum from 1890 are not as detailed as we would like, but we cannot enhance them. The mummified girl and her accoutrements, including the maize, came from a stone tower or chullpa south of La Paz, in what is now Bolivia. We do not know which stone chullpa, so a map would be of limited use.  The mortuary group is identified as Inca, but as we note the accoutrements do not appear of high status, so it is possible that she is not an elite.  Mud tombs are normally attributed to the local population, and stone towers to Inca or elites. We have clarified at multiple places in the text that the maize is from the period of Inca incursion in this part of Bolivia and have modified text to reflect greater uncertainty of Inca or local origin, but that selection for environmentally favorable characteristics had taken place.  Regardless, there are three 15th c CE or AD AMS ages on the maize, a cucurbita rind, and a camelid fiber.  The maize is almost certainly mid to late 15th century CE.

      (2) Dismissal of the admixture in the reported samples is not evidenced correctly. Population f3 statistic with an outgroup is indeed one of the most robust metrics for sample relatedness; however, it should not be used as a test of admixture. For an admixture test, the population f3 statistic should be used in the form: i) target population, ii) one possible parental population, iii) another possible parental population. This is typically done iteratively with all combinations of possible parental populations. Even in such a form, the population f3 statistic is not very sensitive to admixture in cases of strong genetic drift, and instead population f4 statistic (with an outgroup) is a recommended test for admixture. 

      We have removed “Our admixture f3-statistics test results suggest aBM is not admixed” in our revised manuscript. Since our goal here is to identify which group(s) has(have) the highest relatedness with aBM, so population f3 statistic with an outgroup is the most robust metric to do the test and to support our conclusion here.

      (3) The geographic placement of the sample based on genetic data is not robust. To make use of the method correctly, it would be necessary to validate that genetic samples in this region follow the assumption of the 'isolation-by-distance' with dense sampling, which has not been done. Additionally, the authors posit that "This suggests that aBM might not only be genetically related to the archaeological maize from ancient Peru, but also in the possible geographic location." The method used to infer the location is based on pure genetic estimation. The above conclusion is not supported by this method, and it directly contradicts the authors' suggestion that the sample comes from Bolivia.  

      We understood that it is necessary to validate the assumption of the 'isolation-by-distance' with dense sampling. But we did not do it because: 1) the ancient maize age ranges from ~5000BP to ~100BP and they were found in very different countries at different times. 2) isolation-by-distance is a population genetic concept and it's often used to test whether populations that are geographically farther apart are also more genetically different. Considering we only have 17 ancient samples in total our sample size is not sufficient for a big population test.

      For "It directly contradicts the authors' suggestion that the sample comes from Bolivia.”, as we described in our manuscript that “Given the provenience of the aBM and its age, it is possible the samples were local or alternatively were introduced into western highland Bolivia from the Inca core area – modern Peru.” The sample recording file did show the aBM sample was found in Bolivia, but we do not know where aBM originally came from before it was found in Bolivia. To answer this question, we used locator.py to predict the potential geographic location that aBM may have originally come from, and our results showed that the predicted location is inside of modern Peru and is also very close to archaeological Peruvian maize.  

      Therefore, our conclusion that "This suggests that aBM might not only be genetically related to the archaeological maize from ancient Peru, but also in the possible geographic location” does not contradict that the sample was found Bolivia.

      (4) The conclusion that Ancient Andean maize is genetically similar to European varieties and hence shares a similar evolutionary history is not well supported. The PCA plot in Figure 4 merely represents sample similarity based on two components (jointly responsible for about 20% of the variation explained), and European samples could be very distant based on other components. Indeed, the direct test using the outgroup f3 statistic does not support that European varieties are particularly closely related to ancient Andean maize. Perhaps these are more closely related to Brazil? We do not know, as this has not been measured. 

      Our conclusion is “We also found that a few types of maize from Europe have a much closer distance to the archaeological maize cluster compared to other modern maize, which indicates maize from Europe might expectedly share certain traits or evolutionary characteristics with ancient maize. It is also consistent with the historical fact that maize spread to Europe after Christopher Columbus's late 15th century voyages to the Americas. But as shown, maize also has diversity inside the European maize cluster. It is possible that European farmers and merchants may have favored different phenotypic traits, and the subsequent spread of specific varieties followed the new global geopolitical maps of the Colonial era”.

      We understood your concerns that two components only explain about 20% of the variation. But as you can see from the Figure 2b in Grzybowski, M.W. et al., 2023 publication, it described that “the first principal component (PC1) of variation for genetic marker data roughly corresponded to the division between domesticated maize and maize wild relatives is only 1.3%”. It shows this is quite common in maize, especially when the datasets include landraces, hybrids, and wild relatives. For our maize dataset, we have archaeological maize data ranging from ~5,000BP to ~100BP, and we also have modern maize, which makes the genetic structure of our data more complicated. Therefore, we think our two components are currently the best explanation currently possible. We also included PCA plot based on component 1 and 3 in Fig4_PCA13.pdf. It does not show that the European samples are very distant.

      For “Perhaps these are more closely related to Brazil?”, thank you for this very good question, but we apologize that we cannot answer this question from our current study because our study focuses on identifying the location where aBM originally came from, establishing and explaining patterns of genetic variability of maize, with a specific focus on maize strains that are related to our current aBM. Thus, we will not explore the story between maize from Brazil and European maize in our current study.

      (5) The conclusion that long branches in the phylogenetic tree are due to selection under local adaptation has no evidence. Long branches could be the result of missing data, nucleotide misincorporations, genetic drift, or simply due to the inability of phylogenetic trees to model complex population-level relationships such as admixture or incomplete lineage sorting. Additionally, captions to Figure S3, do not explain colour-coding.  

      We have removed “aBM tends to have long branches compare to tropicalis maize, which can be explained by adaption for specific local environment by time.” in our revised manuscript.

      We have added the color-coding information under Fig. S3 in our revised manuscript.

      (6) The conclusion that selection detected in aBM sample is due to Inca influence has no support. Firstly, selection signature can be due to environmental or other factors. To disentangle those, the authors would need to generate the data for a large number of samples from similar cultural contexts and from a wide-ranging environmental context, followed by a formal statistical test. Secondly, allele frequency increase can be attributed to selection or demographic processes, and alone is not sufficient evidence for selection. The presented XP-EHH method seems more suitable. Overall, methods used in this paper raise some concerns: i) how accurate are allele-frequency tests of selection when only single individual is used as a proxy for a whole population, ii) the significance threshold has been arbitrary fixed to an absolute number based on other studies, but the standard is to use, for example, top fifth percentile. Finally, linking selection to particular GO terms is not strong evidence, as correlation does not imply causation, and links are unclear anyway. 

      In sum, this manuscript presents new data that seems to be of high quality, but the analyses are frequently inappropriate and/or over-interpreted. 

      Regarding your suggestion that “from similar cultural contexts and from a wide-ranging environmental context, followed by a formal statistical test”, we apologize that this cannot be done in our current study because we could not find other archaeological maize samples/datasets that are from similar cultural contexts.

      For “Secondly, allele frequency increase can be attributed to selection or demographic processes, and alone is not sufficient evidence for selection.” Yes, we agree, and that’s why we said it “inferred” the conclusion instead of “indicated”. Furthermore, we revised the whole manuscript following all reviewers’ comments and reorganized and reduced the part on selection on aBM.

      For “The presented XP-EHH method seems more suitable”, we do not think XP-EHH is the best method that could be used here because we only have one aBM sample, but XP-EHH is more suitable for a population analysis.

      For “Finally, linking selection to particular GO terms is not strong evidence, as correlation does not imply causation, and links are unclear anyway.”, as we described in our manuscript, our results “inferred” instead of “indicated” the conclusion.

      Reviewer #2 (Public review): 

      Summary: 

      The manuscript presents valuable new datasets from two ancient maize seeds that contribute to our growing understanding of the maize evolution and biodiversity landscape in pre-colonial South America. Some of the analyses are robust, but the selection elements are not supported. 

      Strengths: 

      The data collection is robust, and the data appear to be of sufficiently high quality to carry out some interesting analytical procedures. The central finding that aBM maize is closely related to maize from the core Inca region is well supported, although the directionality of dispersal is not supported. 

      Weaknesses: 

      Thank you for your comments and suggestions. See below for responses and explanations.

      The selection results are not justified, see examples in the detailed comments below. 

      (1) The manuscript mentions cultural and natural selection (line 76), but then only gives a couple of examples of selecting for culinary/use traits. There are many examples of selection to tolerate diverse environments that could be relevant for this discussion, if desired. 

      We have added related examples with references supported in our revised manuscript.  

      (2) I would be extremely cautious about interpreting the observations of a Spanish colonizer (lines 95-99) without very significant caveats. Indigenous agriculture and food ways would have been far more nuanced than what could be captured in this context, and the genocidal activities of the Europeans would have impacted food production activities to a degree, and any contemporaneous accounts need to be understood through that lens.  

      We agree with the first part of this comment and have softened our use of this particular textual material such that it is far less central to interpretation.While of interest, we cannot evaluate the impact of colonial European activities or observational bias for purposes of this analysis.

      (3) The f3 stats presented in Figure 2 are not set up to test any specific admixture scenarios, so it is unsupported to conclude that the aBM maize is not admixed on this basis (lines 201-202). The original f3 publication (Patterson et al, 2012) describes some scenarios where f3 characteristics associate with admixture, but in general, there are many caveats to this approach, and it's not the ideal tool for admixture testing, compared with e.g., f4 and D (abba-baba) statistics.  

      You make an important point that f3 stats is not the ideal tool for admixture testing. Since our study goal here is to identify which group(s) has(have) the highest relatedness with aBM, the population f3 statistic with an outgroup is the most robust metrics with which to do the test and to support our conclusion here. We have removed the “Our admixture f3-statistics test results suggest aBM is not admixed” in our revised manuscript.

      (4) I'm a little bit skeptical that the Locator method adds value here, given the small training sample size and the wide geographic spread and genetic diversity of the ancient samples that include Central America. The paper describing that method (Battey et al 2020 eLife) uses much larger datasets, and while the authors do not specifically advise on sample sizes, they caution about small sample size issues. We have already seen that the ancient Peruvian maize has the most shared drift with aBM maize on the basis of the f3 stats, and the Locator analysis seems to just be reiterating that. I would advise against putting any additional weight on the Locator results as far as geographic origins, and personally I would skip this analysis in this case.  

      As we described in our manuscript, we have 17 archaeological samples in total. Please find more detailed information from the “geographical location prediction” section.

      We cannot add more ancient samples because they are all that we could find from all previous publications. We may still want to keep this analysis because f3 stats indicates the genome similarity, but the purpose of locator.py analysis is indicating the predicted location of origin of a genetic sample by comparing it to a set of samples of known geographic origin. 

      (5) The overlap in PCA should not be used to confirm that aBM is authentically ancient, because with proper data handling, PCA placement should be agnostic to modern/ancient status (see lines 224-226). It is somewhat unexpected that the ancient Tehuacan maize (with a major teosinte genomic component) falls near the ancient South American maize, but this could be an artifact of sampling throughout the PCA and the lack of teosinte samples that might attract that individual.  

      We have removed “which supports the authenticity of aBM as archaeological maize” in our revised manuscript. The PCA was only applied for all maize samples, so we did not include any teosinte samples in the analysis.

      (6) What has been established (lines 250-251) is genetic similarity to the Inca core area, not necessarily the directionality. Might aBM have been part of a cultural region supplying maize to the Inca core region, for example? Without a specific test of dispersal directionality, which I don't think is possible with the data at hand, this is somewhat speculative. 

      We added this and re-wrote this part in our revised manuscript.

      (7) Singleton SNPs are not a typical criterion for identifying selection; this method needs some citations supporting the exact approach and validation against neutral expectations (line 278). Without Datasets S2 and S3, which are not included with this submission, it is difficult to assess this result further. However, it is very unexpected that ~18,000 out of ~49,000 SNPs would be unique to the aBM lineage. This most likely reflects some data artifact (unaccounted damage, paralogs not treated for high coverage, which are extremely prevalent in maize, etc). I'm confused about unique SNPs in this context. How can they be unique to the aBM lineage if the SNPs used overlap the Grzybowski set? The GO results do not include any details of the exact method used or a statistical assessment of the results. It is not clear if the GO terms noted are statistically enriched.  

      We have added references 53 and 54 in our revised manuscript, and we also uploaded the Datasets S2 and S3.

      For “I'm confused about unique SNPs in this context. How can they be unique to the aBM lineage if the SNPs used overlap the Grzybowski set?”, as we described in our materials and method part that “To achieve potential unique selection on aBM, we calculated the allele frequency for each SNPs between aBM and other archaeological maize, resulting in allele frequency data for 49,896 SNPs. Of these,18,668 SNPs were unique to aBM.”  Thus, the unique SNPs for aBM came from the comparison between aBM with other archaeological maize, and we did not use any modern maize data from the Grzybowski set.

      For “The GO results do not include any details of the exact method used or a statistical assessment of the results. It is not clear if the GO terms noted are statistically enriched.” We did not do GO Term enrichment, so there are no statistical assessments for the results. What we have done was we retained the GO Terms information for each gene by checking their biological process from MaizeGDB, after that, we summarized the results in Dataset S4.

      (8) The use of XP-EHH with pseudo haplotype variant calls is not viable (line 293). It is not clear what exact implementation of XP-EHH was used, but this method generally relies on phased or sometimes unphased diploid genotype calls to observe shared haplotypes, and some minimum population size to derive statistical power. No implementation of XP-EHH to my knowledge is appropriate for application to this kind of dataset. 

      We used the same XP-EHH as this publication “Sabeti, P.C. et al. Genome-wide detection and characterization of positive selection in human populations. Nature 449, 913-918 (2007).” Specifically in our analysis, the SNP information of modern maize was compared with ancient maize. The code is available in https://doi.org/10.5061/dryad.w6m905qtd.

      XP-EHH is a statistical method used in population genetics to detect recent positive selection in one population compared to another, and it often applied in modern large maize populations in previous research. In our study, we wanted to detect recent positive selection in modern maize compared to ancient maize, thus, we applied XP-EHH here. Although the population size of ancient maize is not big, it is the best method that we can apply for our dataset here to detect recent selection on modern maize.

      Reviewer #3 (Public review): 

      Summary: 

      The authors seek to place archaeological maize samples (2 kernels) from Bolivia into genetic and geographical context and to assess signatures of selection. The kernels were dated to the end of the Incan empire, just prior to European colonization. Genetic data and analyses were used to characterize the distance from other ancient and modern maize samples and to predict the origin of the sample, which was discovered in a tomb near La Paz, Bolivia. Given the conquest of this region by the Incan empire, it is possible that the sample could be genetically similar to populations of maize in Peru, the center of the Incan empire. Signatures of selection in the sample could help reveal various environmental variables and cultural preferences that shaped maize genetic diversity in this region at that time. 

      Strengths: 

      The authors have generated substantial genetic data from these archaeological samples and have assembled a data set of published archaeological and modern maize samples that should help to place these samples in context. The samples are dated to an interesting time in the history of South America during a period of expansion of the Incan empire and just prior to European colonization. Much could be learned from even this small set of samples. 

      Weaknesses: 

      Many thanks for your comments and suggestions.  We have addressed these below and provided further explanation.

      (1) Sample preparation and sequencing: 

      Details of the quality of the samples, including the percentage of endogenous DNA are missing from the methods. The low percentage of mapped reads suggests endogenous DNA was low, and this would be useful to characterize more fully. Morphological assessment of the samples and comparison to morphological data from other maize varieties is also missing. It appears that the two kernels were ground separately and that DNA was isolated separately, but data were ultimately pooled across these genetically distinct individuals for analysis. Pooling would violate assumptions of downstream analysis, which included genetic comparison to single archaeological and modern individuals. 

      We did not do the morphological assessment of the samples and comparison to morphological data from other maize varieties because we only have 2 aBM kernels, and we do not have other archaeological samples that could be used to do comparison.

      For “It appears that the two kernels were ground separately and that DNA was isolated separately, but data were ultimately pooled across these genetically distinct individuals for analysis”, as you can see from our Materials and Methods section that “Whole kernels were crushed in a mortar and pestle”, these two kernels were ground together before sequenced. 

      While morphological assessment of the sample would be interesting, most morphological data reported for maize are from microremains (starch, phytoliths, pollen) and this is beyond the scope of our study. Most studies of macrobotanical remains do not appear to focus solely on individual kernels, but instead on (or in combination with) cob and ear shape, which were not available in the assemblage.

      (2) Genetic comparison to other samples: 

      The authors did not meaningfully address the varying ages of the other archaeological samples and modern maize when comparing the genetic distance of their samples. The archaeological samples were as old as >5000 BP to as young as 70 BP and therefore have experienced varying extents of genetic drift from ancestral allele frequencies. For this reason, age should explicitly be included in their analysis of genetic relatedness. 

      We have changed related part in our revised manuscript.

      (3) Assessment of selection in their ancient Bolivian sample: 

      This analysis relied on the identification of alleles that were unique to the ancient sample and inferred selection based on a large number of unique SNPs in two genes related to internode length. This could be a technical artifact due to poor alignment of sequence data, evidence supporting pseudogenization, or within an expected range of genetic differentiation based on population structure and the age of the samples. More rigor is needed to indicate that these genetic patterns are consistent with selection. This analysis may also be affected by the pooling of the Bolivian archaeological samples.  

      We do not think it is because of poor alignment of sequence data since we used BWA v0.7.17 with disabled seed (-l 1024) and 0 mismatch alignment. Therefore, there are no SNPs that could come from poor alignment. Please see our detailed methods description here “For the archaeological maize samples, adapters were removed and paired reads were merged using AdapterRemoval60 with parameters --minquality 20 --minlength 30. All 5՛ thymine and 3՛ adenine residues within 5nt of the two ends were hard-masked, where deamination was most concentrated. Reads were then mapped to soft-masked B73 v5 reference genome using BWA v0.7.17 with disabled seed (-l 1024 -o 0 -E 3) and a quality control threshold (-q 20) based on the recommended parameter61 to improve ancient DNA mapping”.

      For “More rigor is needed to indicate that these genetic patterns are consistent with selection”, Could you please be more specific about which method or approach we should use here? For example, methods from specific publications that could be referenced? Or which specific tool could be used?

      “This analysis may also be affected by the pooling of the Bolivian archaeological samples.” As we could not prove these two seeds came from two different individual plants, we do not think this analysis was affected by the pooling of the Bolivian archaeological samples.

      (4) Evidence of selection in modern vs. ancient maize: In this analysis, samples were pooled into modern and ancient samples and compared using the XP-EHH statistic. One gene related to ovule development was identified as being targeted by selection, likely during modern improvement. Once again, ancient samples span many millennia and both South, Central, and North America. These, and the modern samples included, do not represent meaningfully cohesive populations, likely explaining the extremely small number of loci differentiating the groups. This analysis is also complicated by the pooling of the Bolivian archaeological samples. 

      Yes, it is possible that ovule development might be a modern improvement. We re-wrote this part in our revised manuscript.

      Reviewer #1 (Recommendations for the authors): 

      My suggestion is to address the comments that outline why the methods used or results obtained are not sufficient to support your conclusions. Overall, I suggest limiting the narrative of Inca influence and framing it as speculation in the discussion section. Presenting conclusions of Inca influence in the title and abstract is not appropriate, given the very questionable evidence. 

      We agree and have changed the title to “Fifteenth century CE Bolivian maize reveals genetic affinities with ancient Peruvian maize”.

      Reviewer #2 (Recommendations for the authors): 

      (1) Line 74: Mexicana is another subspecies of teosinte; the distinction is between ssp. mexicana and ssp. parviglumis (Balsas teosinte), not mexicana and teosinte. 

      We have corrected this in our revised manuscript.

      (2) Line 100-102: This is a bit confusing, it cannot have been a symbol of empire "since its first introduction", since its introduction long predates the formation of imperial politics in the region. Reference 17 only treats the late precolonial Inca context, while ref 22 (which cites maize cultivation at 2450 BC, not 3000 BC) makes no reference to ritual/feasting contexts; it simply documents early phytolith evidence for maize cultivation. As such, this statement is not supported by the references offered.

      lines 100-102. This point is well taken and was poor prose on our part.  We have modified this discussion to reflect both the confusing statement and we have corrected our mistake in age for reference 22. associated prose has been modified accordingly.

      We have corrected them as “Indeed, in the Andes, previous research showed that under the Inca empire, maize was fulfilled multiple contextual roles. In some cases, it operated as a sacred crop” and “…since its first introduction to the region around 2500 BC”.

      (3) Line 161: IntCal is likely not the appropriate calibration curve for this region; dates should probably be calibrated using SHCal.  

      We greatly appreciate this important (and correct) observation. We have completely recalibrated the maize AMS result based on the southern hemisphere calibration curve, discussed the new calibrations, and have also invoked two other AMS dates also subjected to the southern hemisphere calibration on associated material for comparison.We are confident in a 15th century AD/CE age for the maize, most likely mid- to late 15th century.  

      (4) Lines 167-169: The increase of G and A residues shown in Supplementary Figure S1a is just before the 5' end of the read within the reference genome context, and is related to fragmentation bias - a different process from postmortem deamination. Deamination leads to 5' C->T and 3' G->A, resulting in increased T at 5' ends and increased A at 3' ends, and the diagnostic damage curve. The reduction of C/T just before reads begin is not a result of deamination. 

      We have removed the “Both features are indicative of postmortem deamination patterns” in our revised manuscript.

      (5) Lines 187-196 This section presents a lot of important external information establishing hypotheses, and needs some references.  

      We have added the related references here.

      (6) Line 421: This makes it sound like damage masking was done BEFORE read mapping. However, this conflicts with the previous paragraph about map Damage, and Supplementary Figure 1 still shows a slight but perceptible damage curve, which is impossible if all terminal Ts and As are hard-masked. This should be reconciled.  

      The Supplementary Figure 1 shows the raw ancient maize DNA sample before damage masking. Specifically, Step1: We used map Damage to check/estimate if the damage exists, and we made the Supplementary Figure 1. Step 2: Then we used our own code hard-masked the damage bases and did read mapping.

      The purpose of Supplementary Figure 1 is to show the authenticity of aBM as archaeological maize. Therefore, it should show a slight but perceptible damage curve.

      (7) Line 460: PCA method is not given (just the LD pruning and the plotting).  

      The merged dataset of SNPs for archaeological and modern maize was used for PCA analysis by using “plink –pca”.

      (8) "tropicalis" maize is not common usage, it is not clear to me what this refers to. 

      We have changed all “tropicalis maize” as “tropical maize” in our revised manuscript.

      (9) The Figure 4 color palette is not accessible for colorblind/color-deficient vision.  

      We have changed the color of Figure 4. Please find the new colors in our upload Figure 4.

      (10) Datasets S2 and S3 are not included with this submission. 

      Thank you for letting us know and your suggestion. We have included Datasets S2 and S3 here.

    1. Author response:

      The following is the authors’ response to the current reviews.

      Public Reviews:

      We thank the Reviewers for their thorough attention to our paper and the interesting discussion about the findings. Before responding to more specific comments, here some general points we would like to clarify:

      (1) Ecological niche models are indeed correlative models, and we used them to highlight environmental factors associated with HPAI outbreaks within two host groups. We will further revise the terminology that could still unintentionally suggest causal inference. The few remaining ambiguities were mainly in the Discussion section, where our intent was to interpret the results in light of the broader scientific literature. Particularly, we will change the following expressions:

      -  “Which factors can explain…” to  “Which factors are associated with…” (line 75);

      -  “the environmental and anthropogenic factors influencing” to “the environmental and anthropogenic factors that are correlated with” (line 273);

      -  “underscoring the influence” to “underscoring the strong association” (line 282).

      (2) We respectfully disagree with the suggestion that an ecological niche modelling (ENM) approach is not appropriate for this work and the research question addressed therein. Ecological niche models are specifically designed to estimate the spatial distribution of the environmental suitability of species and pathogens, making them well suited to our research questions. In our study, we have also explicitly detailed the known limitations of ecological niche models in the Discussion section, in line with prior literature, to ensure their appropriate interpretation in the context of HPAI.

      (3) The environmental layers used in our models were restricted to those available at a global scale, as listed in Supplementary Information Resources S1 (https://github.com/sdellicour/h5nx_risk_mapping/blob/master/Scripts_%26_data/SI_Resource_S1.xlsx). Naturally, not all potentially relevant environmental factors could be included, but the selected layers are explicitly documented and only these were assessed for their importance. Despite this limitation, the performance metrics indicate that the models performed well, suggesting that the chosen covariates capture meaningful associations with HPAI occurrence at a global scale.

      Reviewer #1 (Public review):

      The authors aim to predict ecological suitability for transmission of highly pathogenic avian influenza (HPAI) using ecological niche models. This class of models identify correlations between the locations of species or disease detections and the environment. These correlations are then used to predict habitat suitability (in this work, ecological suitability for disease transmission) in locations where surveillance of the species or disease has not been conducted. The authors fit separate models for HPAI detections in wild birds and farmed birds, for two strains of HPAI (H5N1 and H5Nx) and for two time periods, pre- and post-2020. The authors also validate models fitted to disease occurrence data from pre-2020 using post-2020 occurrence data. I thank the authors for taking the time to respond to my initial review and I provide some follow-up below.

      Detailed comments:

      In my review, I asked the authors to clarify the meaning of "spillover" within the HPAI transmission cycle. This term is still not entirely clear: at lines 409-410, the authors use the term with reference to transmission between wild birds and farmed birds, as distinct to transmission between farmed birds. It is implied but not explicitly stated that "spillover" is relevant to the transmission cycle in farmed birds only. The sentence, "we developed separate ecological niche models for wild and domestic bird HPAI occurrences ..." could have been supported by a clear sentence describing the transmission cycle, to prime the reader for why two separate models were necessary.

      We respectfully disagree that the term “spillover” is unclear in the manuscript. In both the Methods and Discussion sections (lines 387-391 and 409-414), we explicitly define “spillover” as the introduction of HPAI viruses from wild birds into domestic poultry, and we distinguish this from secondary farm-to-farm transmission. Our use of separate ecological niche models for wild and domestic outbreaks reflects not only the distinction between primary spillover and secondary transmission, but also the fundamentally different ecological processes, surveillance systems, and management implications that shape outbreaks in these two groups. We will clarify this choice in the revised manuscript when introducing the separate models. Furthermore, on line 83, we will add “as these two groups are influenced by different ecological processes, surveillance biases, and management contexts”.

      I also queried the importance of (dead-end) mammalian infections to a model of the HPAI transmission risk, to which the authors responded: "While spillover events of HPAI into mammals have been documented, these detections are generally considered dead-end infections and do not currently represent sustained transmission chains. As such, they fall outside the scope of our study, which focuses on avian hosts and models ecological suitability for outbreaks in wild and domestic birds." I would argue that any infections, whether they are in dead-end or competent hosts, represent the presence of environmental conditions to support transmission so are certainly relevant to a niche model and therefore within scope. It is certainly understandable if the authors have not been able to access data of mammalian infections, but it is an oversight to dismiss these infections as irrelevant.

      We understand the Reviewer’s point, but our study was designed to model HPAI occurrence in avian hosts only. We therefore restricted our analysis to wild birds and domestic poultry, which represent the primary hosts for HPAI circulation and the focus of surveillance and control measures. While mammalian detections have been reported, they are outside the scope of this work.

      Correlative ecological niche models, including BRTs, learn relationships between occurrence data and covariate data to make predictions, irrespective of correlations between covariates. I am not convinced that the authors can make any "interpretation" (line 298) that the covariates that are most informative to their models have any "influence" (line 282) on their response variable. Indeed, the observation that "land-use and climatic predictors do not play an important role in the niche ecological models" (line 286), while "intensive chicken population density emerges as a significant predictor" (line 282) begs the question: from an operational perspective, is the best (e.g., most interpretable and quickest to generate) model of HPAI risk a map of poultry farming intensity?

      We agree that poultry density may partly reflect reporting bias, but we also assumed it a meaningful predictor of HPAI risk. Its importance in our models is therefore expected. Importantly, our BRT framework does more than reproduce poultry distribution: it captures non-linear relationships and interactions with other covariates, allowing a more nuanced characterisation of risk than a simple poultry density map. Note also that we distinguished in our models intensive and extensive chicken poultry density and duck density. Therefore, it is not a “map of poultry farming intensity”. 

      At line 282, we used the word “influence” while fully recognising that correlative models cannot establish causality. Indeed, in our analyses, “relative influence” refers to the importance metric produced by the BRT algorithm (Ridgeway, 2020), which measures correlative associations between environmental factors and outbreak occurrences. These scores are interpreted in light of the broader scientific literature, therefore our interpretations build on both our results and existing evidence, rather than on our models alone. However, in the next version of the paper, we will revise the sentence as: “underscoring the strong association of poultry farming practices with HPAI spread (Dhingra et al., 2016)”. 

      I have more significant concerns about the authors' treatment of sampling bias: "We agree with the Reviewer's comment that poultry density could have potentially been considered to guide the sampling effort of the pseudo-absences to consider when training domestic bird models. We however prefer to keep using a human population density layer as a proxy for surveillance bias to define the relative probability to sample pseudo-absence points in the different pixels of the background area considered when training our ecological niche models. Indeed, given that poultry density is precisely one of the predictors that we aim to test, considering this environmental layer for defining the relative probability to sample pseudo-absences would introduce a certain level of circularity in our analytical procedure, e.g. by artificially increasing to influence of that particular variable in our models." The authors have elected to ignore a fundamental feature of distribution modelling with occurrence-only data: if we include a source of sampling bias as a covariate and do not include it when we sample background data, then that covariate would appear to be correlated with presence. They acknowledge this later in their response to my review: "...assuming a sampling bias correlated with poultry density would result in reducing its effect as a risk factor." In other words, the apparent predictive capacity of poultry density is a function of how the authors have constructed the sampling bias for their models. A reader of the manuscript can reasonably ask the question: to what degree are is the model a model of HPAI transmission risk, and to what degree is the model a model of the observation process? The sentence at lines 474-477 is a helpful addition, however the preceding sentence, "Another approach to sampling pseudo-absences would have been to distribute them according to the density of domestic poultry," (line 474) is included without acknowledgement of the flow-on consequence to one of the key findings of the manuscript, that "...intensive chicken population density emerges as a significant predictor..." (line 282). The additional context on the EMPRES-i dataset at line 475-476 ("the locations of outbreaks ... are often georeferenced using place name nomenclatures") is in conflict with the description of the dataset at line 407 ("precise location coordinates"). Ultimately, the choices that the authors have made are entirely defensible through a clear, concise description of model features and assumptions, and precise language to guide the reader through interpretation of results. I am not satisfied that this is provided in the revised manuscript.

      We thank the Reviewer for this important point. To address it, we compared model predictive performance and covariate relative influences obtained when pseudo-absences were weighted by poultry density versus human population density (Author response table 1). The results show that differences between the two approaches are marginal, both in predictive performance (ΔAUC ranging from -0.013 to +0.002) and in the ranking of key predictors (see below Author response images 1 and 2). For instance, intensive chicken density consistently emerged as an important predictor regardless of the bias layer used.

      Note: the comparison was conducted using a simplified BRT configuration for computational efficiency (fewer trees, fixed 5-fold random cross-validation, and standardised parameters). Therefore, absolute values of AUC and variable importance may differ slightly from those in the manuscript, but the relative ranking of predictors and the overall conclusions remain consistent.

      Given these small differences, we retained the approach using human population density. We agree that poultry density partly reflects surveillance bias as well as true epidemiological risk, and we will clarify this in the revised manuscript by noting that the predictive role of poultry density reflects both biological processes and surveillance systems. Furthermore, on line 289, we will add “We note, however, that intensive poultry density may reflect both surveillance intensity and epidemiological risk, and its predictive role in our models should be interpreted in light of both processes”.

      Author response table 1.

      Comparison of model predictive performances (AUC) between pseudo-absence sampling were weighted by poultry density and by human population density across host groups, virus types, and time periods. Differences in AUC values are shown as the value for poultry-weighted minus human-weighted pseudo-absences.

      Author response image 1.

      Comparison of variable relative influence (%) between models trained with pseudo-absences weighted by poultry density (red) and human population density (blue) for domestic bird outbreaks. Results are shown for four datasets: H5N1 (<2020), H5N1 (>2020), H5Nx (<2020), and H5Nx (>2020).

      Author response image 2.

      Comparison of variable relative influence (%) between models trained with pseudo-absences weighted by poultry density (red) and human population density (blue) for wild bird outbreaks. Results are shown for three datasets: H5N1 (>2020), H5Nx (<2020), and H5Nx (>2020).

      The authors have slightly misunderstood my comment on "extrapolation": I referred to "environmental extrapolation" in my review without being particularly explicit about my meaning. By "environmental extrapolation", I meant to ask whether the models were predicting to environments that are outside the extent of environments included in the occurrence data used in the manuscript. The authors appear to have understood this to be a comment on geographic extrapolation, or predicting to areas outside the geographic extent included in occurrence data, e.g.: "For H5Nx post-2020, areas of high predicted ecological suitability, such as Brazil, Bolivia, the Caribbean islands, and Jilin province in China, likely result from extrapolations, as these regions reported few or no outbreaks in the training data" (lines 195-197). Is the model extrapolating in environmental space in these regions? This is unclear. I do not suggest that the authors should carry out further analysis, but the multivariate environmental similarly surface (MESS; see Elith et al., 2010) is a useful tool to visualise environmental extrapolation and aid model interpretation.

      On the subject of "extrapolation", I am also concerned by the additions at lines 362-370: "...our models extrapolate environmental suitability for H5Nx in wild birds in areas where few or no outbreaks have been reported. This discrepancy may be explained by limited surveillance or underreporting in those regions." The "discrepancy" cited here is a feature of the input dataset, a function of the observation distribution that should be captured in pseudo-absence data. The authors state that Kazakhstan and Central Asia are areas of interest, and that the environments in this region are outside the extent of environments captured in the occurrence dataset, although it is unclear whether "extrapolation" is informed by a quantitative tool like a MESS or judged by some other qualitative test. The authors then cite Australia as an example of a region with some predicted suitability but no HPAI outbreaks to date, however this discussion point is not linked to the idea that the presence of environmental conditions to support transmission need not imply the occurrence of transmission (as in the addition, "...spatial isolation may imply a lower risk of actual occurrences..." at line 214). Ultimately, the authors have not added any clear comment on model uncertainty (e.g., variation between replicated BRTs) as I suggested might be helpful to support their description of model predictions.

      Many thanks for the clarification. Indeed, we interpreted your previous comments in terms of geographic extrapolations. We thank the Reviewer for these observations. We will adjust the wording to further clarify that predictions of ecological suitability in areas with few or no reported outbreaks (e.g., Central Asia, Australia) are not model errors but expected extrapolations, since ecological suitability does not imply confirmed transmission (for instance, on Line 362: “our models extrapolate environmental suitability” will be changed to “Interestingly, our models extrapolate geographical”). These predictions indicate potential environments favorable to circulation if the virus were introduced.

      In our study, model uncertainty is formally assessed when comparing the predictive performances of our models (Fig. S3, Table S1), the relative influence (Table S3) and response curves (Fig. 2) associated with each environmental factor (Table S2). All the results confirming a good converge between these replicates. Finally, we indeed did not use a quantitative tool such as a MESS to assess extrapolation but did rely on qualitative interpretation of model outputs.

      All of my criticisms are, of course, applied with the understanding that niche modelling is imperfect for a disease like HPAI, and that data may be biased/incomplete, etc.: these caveats are common across the niche modelling literature. However, if language around the transmission cycle, the niche, and the interpretation of any of the models is imprecise, which I find it to be in the revised manuscript, it undermines all of the science that is presented in this work.

      We respectfully disagree with this comment. The scope of our study and the methods employed are clearly defined in the manuscript, and the limitations of ecological niche modelling in this context are explicitly acknowledged in the Discussion section. While we appreciate the Reviewer’s concern, the comment does not provide specific examples of unclear or imprecise language regarding the transmission cycle, niche, or interpretation of the models. Without such examples, it is difficult to identify further revisions that would improve clarity.

      Reviewer #2 (Public review):

      The geographic range of highly pathogenic avian influenza cases changed substantially around the period 2020, and there is much interest in understanding why. Since 2020 the pathogen irrupted in the Americas and the distribution in Asia changed dramatically. This study aimed to determine which spatial factors (environmental, agronomic and socio-economic) explain the change in numbers and locations of cases reported since 2020 (2020--2023). That's a causal question which they address by applying correlative environmental niche modelling (ENM) approach to the avian influenza case data before (2015--2020) and after 2020 (2020--2023) and separately for confirmed cases in wild and domestic birds. To address their questions they compare the outputs of the respective models, and those of the first global model of the HPAI niche published by Dhingra et al 2016.

      We do not agree with this comment. In the manuscript, it is well established that we are quantitatively assessing factors that are associated with occurrences data before and after 2020. We do not claim to determine the causality. One sentence of the Introduction section (lines 75-76) could be confusing, so we intend to modify it in the final revision of our manuscript. 

      ENM is a correlative approach useful for extrapolating understandings based on sparse geographically referenced observational data over un- or under-sampled areas with similar environmental characteristics in the form of a continuous map. In this case, because the selected covariates about land cover, use, population and environment are broadly available over the entire world, modelled associations between the response and those covariates can be projected (predicted) back to space in the form of a continuous map of the HPAI niche for the entire world.

      We fully agree with this assessment of ENM approaches.

      Strengths:

      The authors are clear about expected bias in the detection of cases, such geographic variation in surveillance effort (testing of symptomatic or dead wildlife, testing domestic flocks) and in general more detections near areas of higher human population density (because if a tree falls in a forest and there is no-one there, etc), and take steps to ameliorate those. The authors use boosted regression trees to implement the ENM, which typically feature among the best performing models for this application (also known as habitat suitability models). They ran replicate sets of the analysis for each of their model targets (wild/domestic x pathogen variant), which can help produce stable predictions. Their code and data is provided, though I did not verify that the work was reproducible.

      The paper can be read as a partial update to the first global model of H5Nx transmission by Dhingra and others published in 2016 and explicitly follows many methodological elements. Because they use the same covariate sets as used by Dhingra et al 2016 (including the comparisons of the performance of the sets in spatial cross-validation) and for both time periods of interest in the current work, comparison of model outputs is possible. The authors further facilitate those comparisons with clear graphics and supplementary analyses and presentation. The models can also be explored interactively at a weblink provided in text, though it would be good to see the model training data there too.

      The authors' comparison of ENM model outputs generated from the distinct HPAI case datasets is interesting and worthwhile, though for me, only as a response to differently framed research questions.

      Weaknesses:

      This well-presented and technically well-executed paper has one major weakness to my mind. I don't believe that ENM models were an appropriate tool to address their stated goal, which was to identify the factors that "explain" changing HPAI epidemiology.

      Here is how I understand and unpack that weakness:

      (1) Because of their fundamentally correlative nature, ENMs are not a strong candidate for exploring or inferring causal relationships.

      (2) Generating ENMs for a species whose distribution is undergoing broad scale range change is complicated and requires particular caution and nuance in interpretation (e.g., Elith et al, 2010, an important general assumption of environmental niche models is that the target species is at some kind of distributional equilibrium (at time scales relevant to the model application). In practice that means the species has had an opportunity to reach all suitable habitats and therefore its absence from some can be interpreted as either unfavourable environment or interactions with other species). Here data sets for the response (N5H1 or N5Hx case data in domestic or wild birds ) were divided into two periods; 2015--2020, and 2020--2023 based on the rationale that the geographic locations and host-species profile of cases detected in the latter period was suggestive of changed epidemiology. In comparing outputs from multiple ENMs for the same target from distinct time periods the authors are expertly working in, or even dancing around, what is a known grey area, and they need to make the necessary assumptions and caveats obvious to readers.

      We thank the Reviewer for this observation. First, we constrained pseudo-absence sampling to countries and regions where outbreaks had been reported, reducing the risk of interpreting non-affected areas as environmentally unsuitable. Second, we deliberately split the outbreak data into two periods (2015-2020 and 2020-2023) because we do not assume a single stable equilibrium across the full study timeframe. This division reflects known epidemiological changes around 2020 and allows each period to be modeled independently. Within each period, ENM outputs are interpreted as associations between outbreaks and covariates, not as equilibrium distributions. Finally, by testing prediction across periods, we assessed both niche stability and potential niche shifts. These clarifications will be added to the manuscript to make our assumptions and limitations explicit.

      Line 66, we will add: “Ecological niche model outputs for range-shifting pathogens must therefore be interpreted with caution (Elith et al., 2010). Despite this limitation, correlative ecological niche models  remain useful for identifying broad-scale associations and potential shifts in distribution. To account for this, we analysed two distinct time periods (2015-2020 and 2020-2023).”

      Line 123, we will revise “These findings underscore the ability of pre-2020 models in forecasting the recent geographic distribution of ecological suitability for H5Nx and H5N1 occurrences” to “These results suggest that pre-2020 models captured broad patterns of suitability for H5Nx and H5N1 outbreaks, while post-2020 models provided a closer fit to the more recent epidemiological situation”.

      (3) To generate global prediction maps via ENM, only variables that exist at appropriate resolution over the desired area can be supplied as covariates. What processes could influence changing epidemiology of a pathogen and are their covariates that represent them? Introduction to a new geographic area (continent) with naive population, immunity in previously exposed populations, control measures to limit spread such as vaccination or destruction of vulnerable populations or flocks? Might those control measures be more or less likely depending on the country as a function of its resources and governance? There aren't globally available datasets that speak to those factors, so the question is not why were they omitted but rather was the authors decision to choose ENMs given their question justified? How valuable are insights based on patterns of correlation change when considering different temporal sets of HPAI cases in relation to a common and somewhat anachronistic set of covariates?

      We agree that the ecological niche models trained in our study are limited to environmental and host factors, as described in the Methods section with the selection of predictors. While such models cannot capture causality or represent processes such as immunity, control measures, or governance, they remain a useful tool for identifying broad associations between outbreak occurrence and environmental context. Our study cannot infer the full mechanisms driving changes in HPAI epidemiology, but it does provide a globally consistent framework to examine how associations with available covariates vary across time periods.

      (4) In general the study is somewhat incoherent with respect to time. Though the case data come from different time periods, each response dataset was modelled separately using exactly the same covariate dataset that predated both sets. That decision should be understood as a strong assumption on the part of the authors that conditions the interpretation: the world (as represented by the covariate set) is immutable, so the model has to return different correlative associations between the case data and the covariates to explain the new data. While the world represented by the selected covariates *may* be relatively stable (could be statistically confirmed), what about the world not represented by the covariates (see point 3)?

      We used the same covariate layers for both periods, which indeed assumes that these environmental and host factors are relatively stable at the global scale over the short timeframe considered. We believe this assumption is reasonable, as poultry density, land cover, and climate baselines do not change drastically between 2015 and 2023 at the resolution of our analysis. We agree, however, that unmeasured processes such as control measures, immunity, or governance may have changed during this time and are not captured by our covariates.

      Recommendations for the Authors:

      Reviewer #1 (Recommendations for the authors):

      - Line 400-401: "over the 2003-2016 periods" has an extra "s"; "two host species" (with reference to wild and domestic birds) would be more precise as "two host groups".

      - Remove comma line 404

      Many thanks for these comments, we have modified the text accordingly.

      Reviewer #2 (Recommendations for the authors):

      Most of my work this round is encapsulated in the public part of the review.

      The authors responded positively to the review efforts from the previous round, but I was underwhelmed with the changes to the text that resulted. Particularly in regard to limiting assumptions - the way that they augmented the text to refer to limitations raised in review downplayed the importance of the assumptions they've made. So they acknowledge the significance of the limitation in their rejoinder, but in the amended text merely note the limitation without giving any sense of what it means for their interpretation of the findings of this study.

      The abstract and findings are essentially unchanged from the previous draft.

      I still feel the near causal statements of interpretation about the covariates are concerning. These models really are not a good candidate for supporting the inference that they are making and there seem to be very strong arguments in favour of adding covariates that are not globally available.

      We never claimed causal interpretation, and we have consistently framed our analyses in terms of associations rather than mechanisms. We acknowledge that one phrasing in the research questions (“Which factors can explain…”) could be misinterpreted, and we are correcting this in the revised version to read “Which factors are associated with…”. Our approach follows standard ecological niche modelling practice, which identifies statistical associations between occurrence data and covariates. As noted in the Discussion section, these associations should not be interpreted as direct causal mechanisms. Finally, all interpretive points in the manuscript are supported by published literature, and we consider this framing both appropriate and consistent with best practice in ecological niche modelling (ENM) studies.

      We assessed predictor contributions using the “relative influence” metric, the terminology reported by the R package “gbm” (Ridgeway, 2020). This metric quantifies the contribution of each variable to model fit across all trees, rescaled to sum to 100%, and should be interpreted as an association rather than a causal effect.

      L65-66 The general difficulty of interpreting ENM output with range-shifting species should be cited here to alert readers that they should not blithely attempt what follows at home.

      I believe that their analysis is interesting and technically very well executed, so it has been a disappointment and hard work to write this assessment. My rough-cut last paragraph of a reframed intro would go something like - there are many reasons in the literature not to do what we are about to do, but here's why we think it can be instructive and informative, within certain guardrails.

      To acknowledge this comment and the previous one, we revised lines 65-66 to: “However, recent outbreaks raise questions about whether earlier ecological niche models still accurately predict the current distribution of areas ecologically suitable for the local circulation of HPAI H5 viruses. Ecological niche model outputs for range-shifting pathogens must therefore be interpreted with caution (Elith et al., 2010). Despite this limitation, correlative ecological niche models  remain useful for identifying broad-scale associations and potential shifts in distribution.”

      We respectfully disagree with the Reviewer’s statement that “_there are many reasons in the literature not to do what we are about to do”._ All modeling approaches, including mechanistic ones, have limitations, and the literature is clear on both the strengths and constraints of ecological niche models. Our manuscript openly acknowledges these limits and frames our findings accordingly. We therefore believe that our use of an ENM approach is justified and contributes valuable insights within these well-defined boundaries.

      Reference: Ridgeway, G. (2007). Generalized Boosted Models: A guide to the gbm package. Update, 1(1), 2007.


      The following is the authors’ response to the original reviews.

      Reviewer #1(Public review):

      I am concerned by the authors' conceptualisation of "niche" within the manuscript. Is the "niche" we are modelling the niche of the pathogen itself? The niche of the (wild) bird host species as a group? The niche of HPAI transmission within (wild) bird host species (i.e., an intersection of pathogen and bird niches)? Or the niche of HPAI transmission in poultry? The precise niche being modelled should be clarified in the Introduction or early in the Methods of the manuscript. The first two definitions of niche listed above are relevant, but separate from the niche modelled in the manuscript - this should be acknowledged.

      We acknowledge that these concepts were probably not enough clearly defined in the previous version of our manuscript, and we have now included an explicit definition in the fourth paragraph of the Introduction section: “We developed separate ecological niche models for wild and domestic bird HPAI occurrences, these models thus predicting the ecological suitability for the risk of local viral circulation leading to the detection of HPAI occurrences within each host group (rather than the niche of the virus or the host species alone).”

      The authors should consider the precise transmission cycle involved in each HPAI case: "index cases" in farmed poultry, caused by "spillover" from wild birds, are relevant to the wildlife transmission cycle, while the ecological conditions coinciding with subsequent transmission in farmed poultry are likely to be fundamentally different. (For example, subsequent transmission is not conditional on the presence of wild birds.) Modelling these two separate, but linked, transmission cycles together may omit important nuances from the modelling framework.

      We thank the Reviewer for highlighting the distinction between primary (wild-todomestic) and secondary (farm-to-farm) transmission cycles. Our modelling framework was designed to assess the ecological suitability of HPAI occurrences in wild and domestic birds separately. In the domestic poultry models, the response variables are the confirmed outbreaks data and do not distinguish between index cases resulting from primary or secondary infections.

      One of the aims of the study is to evaluate the spatial distribution of areas ecologically suitable for local H5N1/x circulation either leading to domestic or wild bird cases, i.e. to identify environmental conditions where the virus may have persisted or spread, whether as a result of introduction by wild birds or farm-to-farm transmission. Introducing mechanistic distinctions in the response variable would not necessarily improve or affect the ecological suitability maps, since each type of transmission is likely to be associated with different covariates that are included in the models.

      Also, the EMPRES-i database does not indicate whether each record corresponds to an index case or a secondary transmission event, so in practice it would not be possible to produce two different models. However, we agree that distinguishing between types of transmission is an interesting perspective for future research. This could be explored, for example, by mapping interfaces between wild and domestic bird populations or by inferring outbreak transmission trees using genomic data when available.

      To avoid confusion, we now explicitly clarify this aspect in the Materials and Methods section: “It is important to note that the EMPRES-i database does not distinguish between index cases (e.g., primary spillover from wild birds) and secondary farm-to-farm transmissions. As such, our ecological niche models are trained on confirmed HPAI outbreaks in poultry that may result from different transmission dynamics — including both initial introduction events influenced by environmental factors and subsequent spread within poultry systems.”

      We now also address this limitation in the Discussion section: “Finally, our models for domestic poultry do not distinguish between primary introduction events (e.g., spillover from wild birds) and secondary transmission between farms due to limitations in the available surveillance data. While environmental factors likely influence the risk of initial spillover events, secondary spread is more often driven by anthropogenic factors such as biosecurity practices and poultry trade, which are not included in our current modelling framework.”

      The authors should clarify the meaning of "spillover" within the HPAI transmission cycle: if spillover transmission is from wild birds to farmed poultry, then subsequent transmission in poultry is separate from the wildlife transmission cycle. This is particularly relevant to the Discussion paragraph beginning at line 244: does "farm to farm transmission" have a distinct ecological niche to transmission between wild birds, and transmission between wild birds and farmed birds? And while there has been a spillover of HPAI to mammals, could the authors clarify that these detections are dead-end? And not represented in the dataset? Dhingra et al., 2016 comment on the contrast between models of "directly transmitted" pathogens, such as HPAI, and vector-borne diseases: for vector-borne diseases, "clear eco-climatic boundaries of vectors can be mapped", whereas "HPAI is probably not as strongly environmentally constrained". This is an important piece of nuance in their Discussion and a comment to a similar effect may be of use in this manuscript.

      Following the Reviewer’s previous comment, we have now added clarifications in the Methods and Discussion sections defining spillover as the transmission of HPAI viruses from wild birds to domestic poultry (index cases), and secondary transmission as onward spread between farms. As mentioned in our answer above, we now emphasise that our models do not distinguish these dynamics, which are likely to be influenced by different drivers — ecological in the case of spillover, and often anthropogenic (e.g., poultry trade movement, biosecurity) in the case of farm-to-farm transmission.

      The discussion regarding farm-to-farm transmission and spillovers is indeed an interpretation derived from the covariates analysis (see the second paragraph in the Discussion section). Specifically, we observed a stronger association between HPAI occurrences and domestic bird density after 2020, which may suggest that secondary infections (e.g., farm-to-farm transmission) became more prominent or more frequently reported. We however acknowledge that our data do not allow us to distinguish primary introductions from secondary transmission events, and we have added a sentence to explicitly clarify this: “However, this remains an interpretation, as the available data do not allow us to distinguish between index cases and secondary transmission events.”

      We thank the Reviewer for raising the point of mammalian infections. While spillover events of HPAI into mammals have been documented, these detections are generally considered dead-end infections and do not currently represent sustained transmission chains. As such, they fall outside the scope of our study, which focuses on avian hosts and models ecological suitability for outbreaks in wild and domestic birds. However, we agree that future work could explore the spatial overlap between mammalian outbreak detections and ecological suitability maps for wild birds to assess whether such spillovers may be linked to localised avian transmission dynamics.

      Finally, we have added a comment about the differences between pathogens strongly constrained by the environments and HPAI: “This suggests that HPAI H5Nx is not as strongly environmentally constrained as vector-borne pathogens, for which clear eco-climatic boundaries (e.g., vector borne diseases) can be mapped (Dhingra et al., 2016).” This aligns with the interpretation provided by Dhingra and colleagues (2016) and helps contextualise the predictive limitations of ecological niche models for directly transmitted pathogens like HPAI.

      There are several places where some simple clarification of language could answer my questions related to ecological niches. For example, on line 74, "the ecological niche" should be followed by "of the pathogen", or "of HPAI transmission in wild birds", or some other qualifier that is most appropriate to the Authors' conceptualisation of the niche modelled in the manuscript. Similarly, in the following sentence, "areas at risk" could be followed by "of transmission in wild birds", to make the transmission cycle that is the subject of modelling clear to the reader. On line 83, it is not clear who or what is the owner of "their ecological niches": is this "poultry and wild birds", or the pathogen?

      We agree with that suggestion and have now modified the related part of the text  accordingly (e.g., “areas at risk for local HPAI circulation” and “of HPAI in wild or domestic birds”).

      I am concerned by the authors' treatment of sampling bias in their BRT modelling framework. If we are modelling the niche of HPAI transmission, we would expect places that are more likely to be subject to disease surveillance to be represented in the set of locations where the disease has been detected. I do not agree that pseudo-absence points are sampled "to account for the lack of virus detection in some areas" - this description is misleading and does not match the following sentence ("pseudo-absence points sampled ... to reflect the greater surveillance efforts ..."). The distribution of pseudo-absences should aim to capture the distribution of probable disease surveillance, as these data act as a stand-in for missing negative surveillance records. It is sensible that pseudo-absences for disease detection in wild birds are sampled proportionately to human population density, as the disease is detected in dead wild birds, which are more likely to be identified close to areas of human occupation (as stated on line 163). However, I do not agree that the same applies to poultry - the density of farmed poultry is likely to be a better proxy for surveillance intensity in farmed birds. Human population density and farmed poultry density may be somewhat correlated (i.e., both are low in remote areas), but poultry density is likely to be higher in rural areas, which are assumed to have relatively lower surveillance intensity under the current approach. The authors allude to this in the Discussion: "monitoring areas with high intensive chicken densities ... remains crucial for the early detection and management of HPAI outbreaks".

      We agree with the Reviewer's comment that poultry density could have potentially been considered to guide the sampling effort of the pseudo-absences to consider when training domestic bird models. We however prefer to keep using a human population density layer as a proxy for surveillance bias to define the relative probability to sample pseudoabsence points in the different pixels of the background area considered when training our ecological niche models. Indeed, given that poultry density is precisely one of the predictors that we aim to test, considering this environmental layer for defining the relative probability to sample pseudo-absences would introduce a certain level of circularity in our analytical procedure, e.g. by artificially increasing to influence of that particular variable in our models.

      Furthermore, it is also worth noting that, to better account for variations in surveillance intensity, we also adjusted the sampling effort by allocating pseudo-absences in proportion to the number of confirmed outbreaks per administrative unit (country or sub-national regions for Russia and China). This approach aimed to reduce bias caused by uneven reporting and surveillance efforts between regions. Additionally, we restricted model training to countries or regions with a minimum surveillance threshold (at least five confirmed outbreaks per administrative unit). Therefore, both presence and pseudo-absence points originated from areas with more consistent surveillance data.

      We acknowledge in the Materials and Methods section that the approach proposed by the Reviewer could have been used: “Another approach to sampling pseudo-absences would have been to distribute them according to the density of domestic poultry.” Finally, our approach is also justified in our response to the next comment of the Reviewer.

      Having written my review, including the paragraph above, I briefly scanned Dhingra et al., and found that they provide justification for the use of human population density to sample pseudoabsences in farmed birds: "the Empres-i database compiles outbreak locations data from very heterogeneous sources and in the absence of explicit GPS location data, the geo-referencing of individual cases is often through the use of place name gazetteers that will tend to force the outbreak location populated place, rather in the exact location of the farm where the disease was found, which would introduce a bias correlated with human population density." This context is entirely missing from the manuscript under review, however, I maintain the comment in the paragraph above - have the Authors trialled sampling pseudo-absences from poultry density layers?

      We agree with the Reviewer’s comment and have now added this precision in the Materials and Methods section (in the third paragraph dedicated to ecological niche modelling): “However, as pointed out by Dhingra and colleagues (2016), the locations of outbreaks in the EMPRES-i database are often georeferenced using place name nomenclatures due to a lack of accurate GPS data, which could introduce a spatial bias towards populated areas.”

      The authors indirectly acknowledge the role of sampling bias in model predictions at line 163, however, this point could be clearer: there is sampling bias in the set of locations where HPAI has been observed and failure to adequately replicate this sampling bias in pseudo-absence data could lead covariates that are correlated with the observation distribution to appear to be correlated with the target distribution. This point is alluded to but should be clearly acknowledged to allow the reader to appropriately interpret your results. I understand the point being made on line 163 is that surveillance of HPAI in wild birds has become more structured and less opportunistic over time - if this is the case, a statement to this effect could replace "which could influence earlier data sets", which is a little ambiguous. The Authors acknowledge the role of sampling bias in lines 241-242 - this may be a good place to remind the reader that they have attempted to incorporate sampling bias through the selection of their pseudoabsence dataset, particularly for wild bird models.

      We thank the Reviewer for this comment. We have now clarified in the text that observed data on HPAI occurrence are inherently influenced by heterogeneous surveillance efforts and that failure to replicate this bias in pseudo-absence sampling could effectively lead to misleading correlations with covariates associated with surveillance effort rather than true ecological suitability. We have now rephrased the related sentence as follows: “This decline may indicate a reduced bias in observation data: typically, dead wild birds are more frequently found near human-populated areas due to opportunistic detections, whereas more recent surveillance efforts have become increasingly proactive (Giacinti et al., 2024).”

      Dhingra et al. aimed to account for the effect of mass vaccination of birds in China. This does not appear to be included in the updated models - is this a relevant covariate to consider in updated models? Are the models trained on pre-2020 data predicting to post-2020 given the same presence dataset as previous models? It may be helpful to provide a comment on this if we consider the pre-2020 models in this work to be representative of pre-2020 models as a cohort. Given the framing of the manuscript as an update to Dhingra et al., it may be useful for the authors to briefly summarise any differences between the existing models and updated models. Dhingra et al., also examine spatial extrapolation, which is not addressed here. Environmental extrapolation may be a useful metric to consider: are there areas where models are extrapolating that are predicted to be at high risk of HPAI transmission? Finally, they also provide some inset panels on global maps of model predictions - something similar here may also be useful.

      We thank the Reviewer for these comments. Vaccination coverage is indeed a relevant covariate for HPAI suitability in domestic birds. However, we did not include this variable in our updated models for two reasons. First, comprehensive vaccination data were only available for China, so it is not possible to include this variable in a global model. Second, available data were outdated and vaccination strategies can vary substantially over time.

      We however agree with the Reviewer that the Materials and Methods section did not clarify clearly the differences with Dhingra et al. (2016), and we now detail these differences at the beginning of the Materials and Methods section: “Our approach is similar to the one implemented by Dhingra and colleagues (2016). While Dhingra et al. (2016) developed their models only for domestic birds over the 2003-2016 periods, our models were developed for two host species separately (wild and domestic birds) and for two time periods (2016-2020 and 2020-2023).”

      We also detail the main difference concerning the pseudo-absences sampling:  Dhingra and colleagues (2016) used human population density to sample pseudo-absences to reflect potential surveillance bias and also account for spatial filtering (min/max distances from presence). We adopted a similar strategy but also incorporated outbreak count per country or province (in the case of China and Russia) into the pseudo-absence sampling process to further account for within-country surveillance heterogeneity. We have now added these specifications in the Materials and Methods section: “To account for heterogeneity in AIV surveillance and minimise the risk of sampling pseudo-absences in poorly monitored regions, we restricted our analysis to countries (or administrative level 1 units in China and Russia) with at least five confirmed outbreaks. Unlike Dhingra et al. (2016), who sampled pseudoabsences across a broader global extent, our sampling was limited to regions with demonstrated surveillance activity. In addition, we adjusted the density of pseudo-absence points according to the number of reported outbreaks in each country or admin-1 unit, as a proxy for surveillance effort — an approach not implemented in this previous study.”

      We have now also provided a comparison between the different outputs, particularly in the Results section: “Our findings were overall consistent with those previously reported by Dhingra and colleagues (Dhingra et al., 2016), who used data from January 2004 to March 2015 for domestic poultry. However, some differences were noted: their maps identified higher ecological suitability for H5 occurrences before 2016 in North America, West Africa, eastern Europe, and Bangladesh, while our maps mainly highlight ecologically suitable regions in China, South-East Asia, and Europe (Fig. S5). In India, analyses consistently identified high ecologically suitable areas for the risk of local H5Nx and H5N1 circulation for the three time periods (pre-2016, 2016-2020, and post-2020). Similar to the results reported by Dhingra and colleagues, we observed an increase in the ecological suitability estimated for H5N1 occurrence in South America's domestic bird populations post-2020. Finally, Dhingra and colleagues identified high suitability areas for H5Nx occurrence in North America, which are predicted to be associated with a low ecological suitability in the 2016-2020 models.”

      We acknowledge that some regions predicted as highly suitable correspond to areas where extrapolation likely occurs due to limited or no recorded outbreaks. We have now added these specifications when discussing the resulting suitability maps obtained for domestic birds: “For H5Nx post-2020, areas of high predicted ecological suitability, such as Brazil, Bolivia, the Caribbean islands, and Jilin province in China, likely result from extrapolations, as these regions reported few or no outbreaks in the training data”, and, for wild birds: “Some of the areas with high predicted ecological suitability reflect the result of extrapolations. This is particularly the case in coastal regions of West and North Africa, the Nile Basin, Central Asia (Kyrgyzstan, Tajikistan, Uzbekistan), Brazil (including the Amazon and coastal areas), southern Australia, and the Caribbean, where ecological conditions are similar to those in areas where outbreaks are known to occur but where records of outbreaks are still rare.”

      For wild birds (H5Nx, post-2020), high ecological suitability was predicted along the West and North African coasts, the Nile basin, Central Asia (e.g., Kyrgyzstan, Tajikistan, Uzbekistan), the Brazilian coast and Amazon region, Caribbean islands, southern Australia, and parts of Southeast Asia. Ecological suitability estimated in these regions may directly result from extrapolations and should therefore be interpreted cautiously.

      We also added a discussion of the extrapolation for wild birds (in the Discussion section): “Interestingly, our models extrapolate environmental suitability for H5Nx in wild birds in areas where few or no outbreaks have been reported. This discrepancy may be explained by limited surveillance or underreporting in those regions. For instance, there is significant evidence that Kazakhstan and Central Asia play a role as a centre for the transmission of avian influenza viruses through migratory birds (Amirgazin et al., 2022; FAO, 2005; Sultankulova et al., 2024). However, very few wild bird cases are reported in EMPRES-i. In contrast, Australia appears environmentally suitable in our models, yet no incursion of HPAI H5N1 2.3.4.4b has occurred despite the arrival of millions of migratory shorebirds and seabirds from Asia and North America. Extensive surveillance in 2022 and 2023 found no active infections nor evidence of prior exposure to the 2.3.4.4b lineage (Wille et al., 2024; Wille and Klaassen, 2023).”

      We agree that inset panels can be helpful for visualising global patterns. However, all resulting maps are available on the MOOD platform (https://app.mood-h2020.eu/core), which provides an interactive interface allowing users to zoom in and out, identify specific locations using a background map, and explore the results in greater detail. This resource is referenced in the manuscript to guide readers to the platform.

      Related to my review of the manuscript's conceptualisation above, there are several inconsistencies in terminology in the manuscript - clearing these up may help to make the methods and their justification clearer to the reader. The "signal" that the models are estimating is variously described as "susceptibility" and "risk" (lines 179-180), "HPAI H5 ecological suitability" (line 78), "likelihood of HPAI occurrences" (line 139), "risk of HPAI circulation" (line 187), "distribution of occurrence data" (line 428). Each of these quantities has slightly different meanings and it is confusing to the reader that all of these descriptors are used for model output. "Likelihood of HPAI occurrences" is particularly misleading: ecological niche models predict high suitability for a species in areas that are similar to environments where it has previously been identified, without imposing constraints on species movement. It is intuitively far more likely that there will be HPAI occurrences in areas where the disease is already established than in areas where an introduction event is required, however, the niche models in this work do not include spatial relationships in their predictions.

      We agree with the Reviewer’s comments. We have now modified the text so that in the Results section we refer to ecological suitability when referring to the outputs of the models. In the context of our Discussion section, we then interpret this ecological suitability in terms of risk, as areas with high ecological suitability being more likely to support local HPAI outbreaks.

      I also caution the authors in their interpretation of the results of BRTs, which are correlative models, so therefore do not tell us what causes a response variable, but rather what is correlated with it. On Line 31, "correlated with" may be more appropriate than "influenced by". On Line 82, "correlated with" is more appropriate than "driving". This is particularly true given the authors' treatment of sampling bias.

      We agree with the Reviewer’s comment and have now rephrased these sentences as follows: “The spatial distribution of HPAI H5 occurrences in wild birds appears to be primarily correlated with urban areas and open water regions” and “Our results provide a better understanding of HPAI dynamics by identifying key environmental factors correlated with the increase in H5Nx and H5N1 cases in poultry and wild birds, investigating potential shifts in their ecological niches, and improving the prediction of at-risk areas.”

      The following sentences in line 201 are ambiguous: "For both H5Nx and H5N1, however, isolated areas on the risk map should be interpreted with caution. These isolated areas may result from sparse data, model limitations, or local environmental conditions that may not accurately reflect true ecological suitability." By "isolated", do the authors mean remote? Or ecologically dissimilar from the set of locations where HPAI has been detected? Or ecologically dissimilar from the set of locations in the joint set of HPAI detection locations and pseudo-absences? Or ecologically similar to the set of locations where HPAI has been detected but spatially isolated? These four descriptors are each slightly different and change the meaning of the sentences. "Model limitations" are also ambiguous - could the authors clarify which specific model limitations they are referring to here? Ultimately, the point being made is probably that a model may predict high ecological suitability for HPAI transmission in areas where the disease has not yet been identified, or where a model is extrapolating in environmental space, however, uncertainty in these predictions may be greater than uncertainty in predictions in areas that are represented in surveillance data. A clear comment on model uncertainty and how it is related to the surveillance dataset and the covariate dataset is currently missing from the manuscript and would be appropriate in this paragraph.

      We understand the Reviewer’s concerns regarding these potential ambiguities, and have now rephrased these sentences as follows: “For both H5Nx and H5N1, certain areas of predicted high ecological suitability appear spatially isolated, i.e. surrounded by regions of low predicted ecological suitability. These areas likely meet the environmental conditions associated with past HPAI occurrences, but their spatial isolation may imply a lower risk of actual occurrences, particularly in the absence of nearby outbreaks or relevant wild bird movements.”

      I am concerned by the wording of the following sentence: "The risk maps reveal that high-risk areas have expanded after 2020" (line 203). This statement could be supported by an acknowledgement of the assumptions the models make of the HPAI niche: are we saying that the niche is unchanged in environmental space and that there are now more geographic areas accessible to the pathogen, or that the niche has shifted or expanded, and that there are now more geographic areas accessible to the pathogen? The authors should review the sentence beginning on line 117: if models trained on data from the old timepoint predicting to the new timepoint are almost as good as models trained on data from the new timepoint predicting to the new timepoint, doesn't this indicate that the niche, as the models are able to capture it, has not changed too much?

      We thank the Reviewer for this comment. The statement that "high-risk areas have expanded after 2020" indeed refers to an increase in the geographic extent of areas predicted to have high ecological suitability in models trained on post-2020 data. This expansion likely reflects new outbreak data from regions that had not previously reported cases, which in turn influenced model training.

      However, models trained on pre-2020 data retain reasonable predictive performance when applied to post-2020 data (see the AUC results reported in Table S1), suggesting that the models suggest an expansion in the ecological suitability, but do not provide definitive evidence of a shift in the ecological niche. We have now added a statement at the end of this paragraph to clarify this point: “However, models trained on pre-2020 data maintained reasonable predictive performance when tested on post-2020 data, suggesting that the overall ecological niche of HPAI did not drastically shift over time.”

      The final two paragraphs of the Results might be more helpful to include at the beginning of the Results, as the data discussed there are inputs to the models. Is it possible that the "rise in Shannon index for sea birds" that "suggests a broadening of species diversity within this category from 2020 onwards" is caused by the increasingly structured surveillance of HPAI in wild birds alluded to earlier in the Results? Is the "prevalence" discussed in line 226 the frequency of the families Laridae and Sulidae being represented in HPAI detection data? Or the abundance of the bird species themselves? The language here is a little ambiguous. Discussion of particular values of Shannon/Simpson indices is slightly out of context as the meanings of the indices are in the Methods - perhaps a brief explanation of the uses of Shannon/Simpson indices may be helpful to the reader here. It may also be helpful to readers who are not acquainted with avian taxonomy to provide common names next to formal names (for example, in brackets) in the body of the text, as this manuscript is published in an interdisciplinary journal.

      We thank the Reviewer for these comments. First, we acknowledge that the paragraphs on species diversity and Shannon/Simpson indices describe important data, but we have chosen to present them after the main modelling results in order to maintain a logical narrative flow. Our manuscript first presents the ecological niche models and their predictive performance, followed by interpretations of the observed patterns, including changes in avian host diversity. Diversity indices were used primarily to support and contextualise the patterns observed in the modelling results.

      For clarity, we have revised the relevant paragraphs in the Results (i) to briefly remind readers of the interpretation of the Shannon and Simpson indices (“Note that these indices reflect the diversity of bird species detected in outbreak records, not necessarily their abundance in the wild”) and (ii) to clarify that “prevalence” refers to the frequency of HPAI detection in wild bird species of the Laridae (gulls) and Sulidae (boobies and gannets) families, and not their total abundance. Family of birds includes several species, so the “common name” of a family can sometimes refer to species from other families. We have now added the common names for each family in the manuscript (even if we indeed acknowledge that “penguins” can be ambiguous).

      In the Methods, it is stated: "To address the heterogeneity of AIV surveillance efforts and to avoid misclassifying low-surveillance areas as unsuitable for virus circulation, we trained the ecological niche models only considering countries in which five or more cases have been confirmed." However, it is not clear how this processing step prevents low-surveillance areas from being misclassified. If pseudo-absences are appropriately sampled, low-surveillance areas should be less represented in the pseudo-absence dataset, which should lead the models to be uncertain in their predictions of these areas. Perhaps "To address the heterogeneity of AIV surveillance efforts and to avoid sampling pseudo-absence data in realistically low-surveillance areas" is a more accurate introduction to the paragraph. I am not entirely convinced that it is appropriate to remove detection data where the national number of cases is low. This may introduce further sampling bias into the dataset.

      We take the opportunity of the Reviewer’s comment to further clarify this important step aiming to mitigate bias associated with countries with substantial uncertainty in reporting and/or potentially insufficient HPAI surveillance data. While we indeed acknowledge that this procedure may exclude countries that had effective surveillance but low virus detection, we argue that it constitutes a relevant conservative approach to minimising the risk of sampling a significant number of pseudo-absence points in areas associated with relatively high yet undetected local HPAI circulation due to insufficient surveillance. Furthermore, given that five cases over two decades is a relatively low threshold — particularly for a highly transmissible virus such as AIV — non-detection or non-reporting remains a more plausible explanation than true absence.

      To improve clarity, we have now revised the related sentence as follows: “To account for heterogeneity in AIV surveillance and minimise the risk of sampling pseudo-absences in poorly monitored regions, we restricted our analysis to countries (or administrative level 1 units in China and Russia) with at least five confirmed outbreaks.”

      The reporting of spatial and temporal resolution of data in the manuscript could be significantly clearer. Is there a reason why human population density is downscaled to 5 arcminutes (~10km at the equator) while environmental covariate data has a resolution of 1km? The projection used is not reported. The authors should clarify the time period/resolution of the covariate data assigned to the occurrence dataset, for example, does "day LST annual mean" represent a particular year pre- or post-2020? Or an average over a number of years? Given that disease detections are associated with observation and reporting dates, and that there may be seasonal patterns in HPAI occurrence, it would be helpful to the reader to include this information when the eco-climatic indices are described. It would also be helpful to the reader to summarise the source, spatial and temporal resolution of all covariates in a table, as in Dhingra et al. Could the Authors clarify whether the duck density layer is farmed ducks or wild ducks?

      The projection is WGS 84 (EPSG:4326) and the resolution of the output maps is around 0.0833 x 0.0833 decimal degrees (i.e. 5 arcmin, or approximately 10 km at the equator). We have now added these specifications in the text: “All maps are in a WGS84 projection with a spatial resolution of 0.0833 decimal degrees (i.e. 5 arcmin, or approximately 10 km at the equator).” In addition, we have now specified in the text that duck refers to domestic duck for clarity. 

      Environmental variables retrieved for our analyses were here available as values averaged over distinct periods of time (for further detail see Supplementary Information Resources S1 — description and source of each environmental variable included in the original sets of variables — available at https://github.com/sdellicour/h5nx_risk_mapping). In future works, this would indeed be interesting to associate the occurrences to a specific season with the variables accordingly, specially for viruses such as HPAI which have been found correlated with seasons. However, we did not conduct this type of analysis in the present study, occurrences being here associated with averaged values of environmental data only.

      In line 407, the authors state a number of pseudo-absence points used in modelling, relative to the number of presence points, without clear justification. Note that relative weights can be assigned to occurrence data in most ECN software (e.g., R package gbm), to allow many pseudo-absence points to be sampled to represent the full extent of probable surveillance effort and subsequently down-weighted.

      We thank the Reviewer for this suggestion. We acknowledge that alternative approaches such as down-weighting pseudo-absence points could offer a certain degree of flexibility in representing surveillance effort. However, we opted for a fixed 1:3 ratio of pseudoabsences to presence points within each administrative unit to ensure a consistent and conservative sampling distribution. This approach aimed to limit overrepresentation of pseudoabsences in areas with sparse presence data, while still reflecting areas of likely surveillance.

      There are a number of typographical errors and phrasing issues in the manuscript. A nonexhaustive list is provided below.

      - Line 21: "its" should be "their" - Line 25: "HPAI cases"

      Modifications have been done.

      - Line 63: sentence beginning "However" is somewhat out of context - what is it (briefly) about recent outbreaks that challenge existing models?

      We have now edited that sentence as follows: “However, recent outbreaks raise questions about whether earlier ecological niche models still accurately predict the current distribution of areas ecologically suitable for the local circulation of HPAI H5 viruses.”

      - Lines 71 and 390: "AIV" is not defined in the text - Line 73: "do" ("are" and "what" are not capitalised)

      Modifications have been done.

      - Line 115: "predictability" should be "predictive capacity"

      We have now replaced “predictability” by “predictive performance”.

      - Line 180: omit "pinpointing"

      - Line 192 sentence beginning "In India," should be re-worded: is the point that there are detections of HPAI here and the model predicts high ecological suitability?

      - Line 195 sentence beginning "Finally," phrasing could be clearer: Dhingra et al. find high suitability areas for H5Nx in North America which are predicted to be low suitability in the new model.

      - Line 237: omit "the" in "with the those"

      - Line 374: missing "."

      - Line 375: "and" should be "to" (the same goes for line 421)

      - Line 448: Rephrase "Simpson index goes" to "The Simpson index ranges"

      Modifications have been done.

      Reviewer #2 (Public Review):

      What is the justification for separating the dataset at 2020? Is it just the gap in-between the avian influenza outbreaks?

      We chose 2020 as a cut-off based on a well-documented shift in HPAI epidemiology, notably the emergence and global spread of clade 2.3.4.4b, which may affect host dynamics and geographic patterns. We have now added this precision in the Materials and Methods section: “We selected 2020 as a cut-off point to reflect a well-documented shift in HPAI epidemiology, notably the emergence and global spread of clade 2.3.4.4b. This event marked a turning point in viral dynamics, influencing both the range of susceptible hosts and the geographical distribution of outbreaks.”

      If the analysis aims to look at changing case numbers and distribution over time, surely the covariate datasets should be contemporaneous with the response?

      Thank you for raising this important point. While we acknowledge that, ideally, covariates should match the response temporally, such high-resolution spatiotemporal environmental data were not available for most environmental factors considered in our ecological niche modelling analyses. While we used predictors (e.g., land-use variables, poultry density) that reflect long-term ecological suitability, we acknowledge that rather considering short-term seasonal variation could be an interesting perspective in future works, which is now explicitly stated in the Discussion section: “In addition, aligning outbreak occurrences with seasonally matched environmental variables could further refine predictions of HPAI risk linked to migratory dynamics.”

      I would expect quite different immunity dynamics between domestic and wild birds as a function of lifespan and birth rates - though no obvious sign of that in the raw data. A statement on assumptions in that respect would be good.

      Thank you for the comment. We agree that domestic and wild birds likely exhibit different immunity dynamics due to differences in lifespan, turnover rates, and exposure. However, our analyses did not explicitly model immunity processes, and the data did not show a clear signal of these differences.

      Decisions and analytical tactics from Dhingra et al are adopted here in a way that doesn't quite convey the rationale, or justify its use here.

      We thank the Reviewer for this observation. However, we do not agree with the notion that the rationale for using Dhingra et al.’s analytical framework is insufficiently conveyed. We adapted key components of their ecological niche modelling approach — such as the use of a boosted regression tree methodology and pseudo-absences sampling procedure — to ensure comparability with their previous findings, while also extending the analysis to additional time periods and host categories (wild vs. domestic birds). This framework aligns with the main objective of our study, which is to assess shifts in ecological suitability for HPAI over time and across host species, in light of changing viral dynamics.  

      Please go over the manuscript and harmonise the language about the model target - it is usually referred to as cases, but sometimes the pathogen, and others the wild and domestic birds where the cases were discovered.

      We agree and we have now modified the text to only use the “cases” or “occurrences” terminology when referring to the model inputs.

      Is the reporting of your BRT implementation correct? The text suggests that only 10 trees were run per replicate (of which there were 10 per response (domestic/wild x H5N1 / H5Nx) x distinct covariate set), but this would suggest that the authors were scarcely benefiting from the 'boosting' part of the BRTs that allow them to accurately estimate curvilinear functions. As additional trees are added, they should still be improving the loss function, and dramatically so in the early stages. The authors seem heavily guided by Elith et al's excellent paper[1] explaining BRTs and the companion tutorial piece, but in that work, the recommended approach is to run an initial model with a relatively quick learning rate that achieves the best fit to the held-out data at somewhere over 1000 trees, and then to refine the model to that number of trees with a slower learning rate. If the authors did indeed run only 10 trees I think that should be explained.

      For each model, we used the “gbm.step” function to fit boosted regression trees, initiating the process with 10 trees and allowing up to 10,000 trees in steps of 5. The optimal number of trees was automatically determined by minimising the cross-validated deviance, following the recommended approach of Elith and colleagues (2008, J. Anim. Ecol.). This setup allows the boosting algorithm to iteratively improve model performance while avoiding overfitting. These aspects are now further clarified in the Materials and Methods section: “All BRT analyses were run and averaged over 10 cross-validated replicates, with a tree complexity of 4, a learning rate of 0.01, a tolerance parameter of 0.001, and while considering 5 spatial folds. Each model was initiated with 10 trees, and additional trees were incrementally added (in steps of 5) up to a maximum of 10,000, with the optimal number selected based on cross-validation tests.”

      I'm uncomfortable with the strong interpretation of changes in indices such as those for diversity in the case of bird species with detected cases of avian influenza, and the relative influence of covariates in the environmental niche models. In the former case, if surveillance effort is increasing it might be expected that more species will be found to be infected. In the latter, I'm just not convinced that these fundamentally correlative models can support the interpretation of changing epidemiology as asserted by authors. This strikes me as particularly problematic in light of static and in some cases anachronistic predictor sets.

      We thank the Reviewer for drawing attention to how changes in surveillance intensity might influence our diversity estimates. We have now integrated a new analysis to evaluate the increase in the number of wild birds tested and discussed the potential impact of this increase on the comparison of the bird species diversity metrics presented in our study, which is now interpreted with more caution: “To evaluate whether the post-2020 increase in species diversity estimated for infected wild birds could result from an increase in the number of tests performed on wild birds, we compared European annual surveillance test counts (EFSA et al., 2025, 2019) before and after 2020 using a Wilcoxon rank-sum test. We relied on European data because it was readily accessible and offered standardised and systematically collected metrics across multiple years, making it suitable for a comparative analysis. Although borderline significant (p-value = 0.063), the Wilcoxon rank-sum test indeed highlighted a recent increase in the number of wild bird tests (on average >11,000/year pre-2020 and >22,000 post-2020), which indicates that the comparison of bird species diversity metrics should be interpreted with caution. However, such an increase in the number of tests conducted in the context of a passive surveillance framework would thus also be in line with an increase in the number of wild birds found dead and thus tested. Therefore, while the increase in the number of tests could indeed impact species diversity metrics such as the Shannon index, it can also reflect an absolute higher wild bird mortality in line with a broadened range of infected bird species.”

    1. Author response:

      The following is the authors’ response to the original reviews

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      The study by Li and coworkers addresses the important and fundamental question of replication initiation in Escherichia coli, which remains open, despite many classic and recent works. It leverages single-cell mRNA-FISH experiments in strains with titratable DnaA and novel DnaA activity reporters to monitor DNA activity peaks versus size. The authors find oscillations in DnaA activity and show that their peaks correlate well with the estimated population-average replication initiation volume across conditions and imposed dnaA transcription levels. The study also proposes a novel extrusion model where DNA-binding proteins regulate free DnaA availability in response to biomass-DNA imbalance. Experimental perturbations of H-NS support the model validity, addressing key gaps in current replication control frameworks.

      Strengths:

      I find the study interesting and well conducted, and I think its main strong points are:

      (1) the novel reporters obtained with systematic synthetic biology methods, and combined with a titratable dnaA strain.

      (2) the interesting perturbations (titration, production arrest, and H-NS).

      (3) the use of single-cell mRNA FISH to monitor transcripts directly.

      The proposed extrusion model is also interesting, though not fully validated, and I think it will contribute positively to the future debate.

      We thank the reviewer for acknowledging the strengths of our study.

      Weaknesses and Limitations:

      (1) A relevant limitation in novelty is that DnaA activity and concentration oscillations have been reported by the cited Iuliani and coworkers previously by dynamic microscopy, and to a smaller extent by the other cited study by Pountain and coworkers using mRNA FISH.

      (2) An important limitation is that the study is not dynamic. While monitoring mRNA is interesting and relevant, the current study is based on concentrations and not time variations (or nascent mRNA). Conversely, the study by Iuliani and coworkers, while having the drawback of monitoring proteins, can directly assess production rates. It would be interesting for future studies or revisions to monitor the strains and reporters dynamically, as well as using (as a control) the technique of this study on the chromosomal reporters used by Iuliani et al.

      We acknowledge the value of dynamic measurements and clarify our methodological rationale.

      While luliani et al. provided valuable temporal resolution through protein dynamics, our mRNA FISH approach achieves direct decoupling of transcriptional vs. post-translational regulation (Fig 4F-H), and condition flexibility across 7 growth rates (30-66 min doubling times). This trade-off sacrifices temporal resolution for enhanced population-scale resolution and perturbation flexibility. To directly address temporal coupling, future work will implement dual-color live imaging of DnaA activity concurrent with replication initiation events.

      (3) Regarding the mathematical models, a lot of details are missing regarding the definitions and the use of such models, which are only presented briefly in the Methods section. The reader is not given any tools to understand the predictions of different models, and no analytical estimates are used. The falsification procedures are not clear. More transparency and depth in the analysis are needed, unless the models are just used as a heuristic tool for qualitative arguments (but this would weaken the claims). The Berger model, for example, has many parameters and many regimes and behaviors. When models are compared to data (e.g., in Figure 2G), it is not clear which parameters were used, how they were fixed, and whether and how the model prediction depends on parameters.

      We agree that model transparency is essential for quantitative validation. To address this, all model parameters (DnaA synthesis rate, activation/deactivation rates etc.) are explicitly tabulated in Supplementary Information Table S6. For the titration (Hansen et al. 1991) and extrusion models, we derive analytical expressions for initiation mass (IM) sensitivity to DnaA expression in Supplementary Note 1. For Figure 2G/S6, we used published parameters (Berger & Wolde 2022 SI Table 2) with experiment growth conditions (μ = 1.54 h<sup>-1</sup>).

      The extrusion model's validation relies primarily on its ability to resolve paradoxical initiation events under dnaA shutdown (Fig 6C), a test where other models fail categorically. While the Berger titration-switch hybrid can fit steady-state IM trends (Fig S6A), it cannot reproduce post-shutdown dynamics without ad hoc modifications (Fig S6B). We acknowledge that comprehensive analysis of all model regimes exceeds this study's scope but provide full simulation code for independent verification: https://github.com/BaiYangBqdq/dynamics_of_biomass_DNA_coordination

      (4) Importantly, the main statement about tight correlations of peak volumes and average estimated initiation volume does not establish coincidence, and some of the claims by the authors are unclear in these respects (e.g., when they say "we resolve a 1:1 coupling between DnaA activity thresholds and replication initiation", the statement could be correct but is ambiguous). Crucially, the data rely on average initiation volumes (on which there seems to be an eternally open debate, also involving the authors), and the estimate procedure relies on assumptions that could lead to biases and uncertainties added to the population variability (in any case, error bars are not provided).

      We acknowledge the limitations of population-level inference and have refined our claims: "Replication initiation volume scales proportionally with peak DnaA activity volume with a slope of 1.0 (R<sub>2</sub>=0.98, Fig 7G), indicating predictive correspondence rather than absolute coincidence. While population-level  𝑉<sub>𝑖</sub> estimation cannot resolve single-cell stochasticity, the consistent 𝑉*: 𝑉<sub>𝑖</sub> relationship across 20 conditions suggest DnaA activity thresholds predict initiation timing within physiological error margins”. Future work will implement simultaneously DnaA activity and replication forks by using microfluidic single-cell tracking.

      (5) The delays observed by the authors (in both directions) between the peaks of DnaAactivity conditional averages with respect to volume and the average estimated initiation volumes are not incompatible with those observed dynamically by Iuliani and coworkers. The direct experiment to prove the authors' point would be to use a direct proxy of replication initiation, such as SeqA or DnaN, and monitor initiations and quantify DnaA activity peaks jointly, with dynamic measurements.

      We acknowledge the observed temporal deviations between DnaA activity peaks (𝑉*) and population-derived volumes at initiation ( 𝑉<sub>𝑖</sub>) in certain conditions, in line with the findings of Iuliani et al. This might be mechanistically consistent with the time required for orisome assembly or oriC sequestration. They do not contradict our core finding that initiation occurs at a defined DnaA activity threshold (slope=1.0, R<sub>2</sub>=0.98 in 𝑉*: 𝑉<sub>𝑖</sub> correlation).

      (6) While not being an expert, I had some doubt that the fact that the reporters are on plasmid (despite a normalization control that seems very sensible) might affect the measurements. Also, I did not understand how the authors validated the assumptions that the reporters are sensitive to DnaA-ATP specifically. It seems this assumption is validated by previous studies only.

      We employed a plasmid-based reporter system to circumvent the significant confounding effects of chromosomal position on promoter activity, as extensively documented by Pountain et al., where local genomic context (e.g., nucleoid occlusion, supercoiling gradients, and neighboring operons) introduces uncontrolled variability. By housing the P<sub>syn66</sub> test promoter and P<sub>con</sub> normalization control in identical low-copy pSC101 vectors (<8 copies/ cell, Peterson & Phillips, Plasmid 2008), we ensured they experience equivalent physical and biochemical environments. This ratiometric design, where DnaA activity is calculated, actively corrects for global fluctuations in RNA polymerase availability, nucleotide pools, and plasmid copy number. Critically, P<sub>syn66</sub>’s architecture emulates natural DnaA-responsive elements: its strong DnaAboxes report free DnaA concentration, while its weak box is preferentially bound by DnaA-ATP (Speck et al., EMBO journal 1999), mirroring the nucleotide-state sensitivity of oriC and the native dnaA promoter. This system was indispensable for our central finding, as it uniquely enabled the decoupling of DnaA activity oscillations from transcriptional feedback (Fig. 4F-H), an experiment fundamentally impossible with chromosomally integrated reporters due to autoregulatory interference.

      Overall Appraisal:

      In summary, this appears as a very interesting study, providing valuable data and a novel hypothesis, the extrusion model, open to future explorations. However, given several limitations, some of the claims appear overstated. Finally, the text contains some selfevaluations, such as "our findings redefine the paradigm for replication control", etc., that appear exaggerated.

      We thank the reviewer for highlighting the need for precise language in framing our conclusions. We have implemented the following substantive revisions throughout the manuscript to ensure claims align strictly with empirical evidence:

      (1) Changed "redefine the paradigm for replication control" into "advance the paradigm for replication control" (Introduction)

      (2) Changed "redefine bacterial cell cycle control" into "refine bacterial cell cycle control as a dynamic interplay..." (Discussion)

      (3) Removed the term "spatial" from the Discussion's description of DnaA-chromosome interactions (Discussion, first paragraph).

      (4) Changed "provides a blueprint" into "provides a valuable tool for dissecting spatial regulation..." (Discussion, final paragraph)

      (5) Scrutinized all superlatives (e.g., "critical feat" into "important capability"; "fundamental principle of cellular organization" into "potential organizational strategy")

      (6) Replaced the instances of "robust" with evidence-backed descriptors (e.g., "sensitive," "consistent")

      (7) We agree that the extrusion model requires further validation and have emphasized this in Discussion: "While H-NS perturbation supports extrusion mechanism, future work should identify the full extruder interactome and elucidate how metabolic signals modulate their activity" (final paragraph)

      This calibrated language more accurately represents our study as a conceptual advance with testable mechanisms, not a complete paradigm shift.

      Reviewer #2 (Public review):

      Summary:

      The authors show that in E. coli, the initiator protein DnaA oscillates post-translationally: its activity rises and peaks exactly when DNA replication begins, even if dnaA transcription is held constant. To explain this, they propose an "extrusion" mechanism in which nucleoidassociated proteins such as H-NS, whose amount grows with cell volume, dislodge DnaA from chromosomal binding sites; modelling and H-NS perturbations reproduce the observed drop in initiation mass and extra initiations seen after dnaA shut-down. Together, the data and model link biomass growth to replication timing through chromosome-driven, posttranslational control of DnaA, filling gaps left by classic titration and ATP/ADP-switch models.

      Strengths:

      (1) Introduces an "extrusion" model that adds a new post-translational layer to replication control and explains data unexplained by classic titration or ATP/ADP-switch frameworks.

      (2) A major asset of the study is that it bridges the longstanding gap between DnaA oscillations and DNA-replication initiation, providing direct single-cell evidence that pulses of DnaA activity peak exactly at the moment of initiation across multiple growth conditions and genetic perturbations.

      (3) A tunable dnaA strain and targeted H-NS manipulations shift initiation mass exactly as the model predicts, giving model-driven validation across growth conditions.

      (4) A purpose-built Psyn66 reporter combined with mRNA-FISH captures DnaA-activity pulses with cell-cycle resolution, providing direct, compelling data.

      We thank the reviewer for acknowledging the strengths of our study.

      Weaknesses:

      (1) What happens to the (C+D) period and initiation time as the dnaA mRNA level changes? This is not discussed in the text or figure and should be addressed.

      We thank the reviewer for this important observation. Our data demonstrate that increased dnaA mRNA levels induce two compensatory changes in cell cycle progression:

      (1) Earlier replication initiation, manifested as a reduced initiation mass: the initiation mass decreased from 5.6 to 2.6 (OD<sub>600</sub>·ml per 10<sup>10</sup> cells) as the relative dnaA mRNA level increased from 0.2 to 7.2 (normalized to the wild-type level) (Fig. 2F, red).

      (2) Prolonged C+D period: Increased by approximately 60% (from 1.05 to 1.66 hours, Fig. 2F blue).

      The complete quantitative relationship is now explicitly described in the Results section: “Concurrently, the initiation mass was reduced by 50%, and the period from initiation to division (C+D) was increased by ~60% (Fig. 2F)”

      (2) It is unclear what is meant by "relative dnaA mRNA level." Relative to what? Wild-type expression? Maximum expression? This should be explicitly defined.

      The relative dnaA mRNA level was obtained by normalizing to that in wild-type MG1655 cells grown in the same medium. To clarify this point, we have now marked the wild-type level in Fig. 1B, and a clear description of this has also been included in the figure caption.

      (3) It would be helpful to provide some intuition for why an increase in dnaA mRNA level leads to a decrease in initiation mass per ori and an increase in oriC copy number.

      Thank you for your valuable suggestion. Increased dnaA mRNA accelerates DnaA accumulation, causing cells to reach the initiation threshold at a smaller cell size (reducing initiation mass, Fig. 2F red). This earlier initiation increases oriC copies per cell at populational level (Fig. 2E). This mechanistic interpretation now appears in the Results: “As the DnaA expression level increases, DnaA activity reaches the initiation threshold earlier. Given that cell mass remained nearly unchanged, this earlier initiation led to an increase in population-averaged cellular oriC numbers (Fig. 2E).”

      (4) The titration and switch models do not explicitly include dnaA mRNA in the dynamics of DnaA protein. Yet, in Figure 2G, initiation mass is shown to decrease linearly with dnaA mRNA level in these models. How was dnaA mRNA level represented or approximated in these simulations?

      All models presented in this article omit explicit modeling of dnaA mRNA dynamics for simplicity. However, at steady state, the relative level of dnaA mRNA can be approximated by the relative expression rate of DnaA protein, as both reflect the expression level of DnaA. This detail is now clarified in the caption of Figure 2G.

      (5) Is Schaechter's law (i.e., exponential scaling of average cell size with growth rate) still valid under the different dnaA mRNA expression conditions tested?

      Schaechter's law describes the exponential scaling of average cell size with growth rate in bacteria. In our prior work (Zheng et al., Nature Microbiology 2020), where we demonstrated that Schaechter's law fails in slow-growth regimes. However, in current study, growth rate remained constant across different dnaA expression levels (Fig. 2C), and cell mass showed no significant change (Fig. 2D). Since Schaechter's law specifically addresses how cell size scales with growth rate, it does not apply here, as growth rate was invariant in our perturbations, which selectively alter replication initiation dynamics, not growth rate or size scaling.

      (6) The manuscript should explain more explicitly how the extrusion model implements posttranslational control of DnaA and, in particular, how this yields the nonlinear drop in relative initiation mass versus dnaA mRNA seen in Figure 6E. Please provide the governing equation that links total DnaA, the volume-dependent "extruder" pool, and the threshold of free DnaA at initiation, and show - briefly but quantitatively - how this equation produces the observed concave curve.

      The governing equations linking initiation mass and DnaA expression level is now provided in Supplementary Note S1 for both the titration and the extrusion model. In general, the dependence of initiation mass (𝑉<sub>𝐼</sub>) on dnaA expression level (𝛼<sub>𝐴</sub>) dependency takes an inverse 1 proportionality form: . In the extrusion model, the incorporated extruder protein is assumed to have similar synthesis dynamics as DnaA and can release DnaA from DnaA-box. After denoting the synthesis rate of the extruder as 𝛼<sub>𝐻</sub>, the combined effect of DnaA and the extruder on replication initiation can be briefly described as: . Then the additive contribution of 𝛼<sub>𝐻</sub> dampens the sensitivity of initiation mass to changes in 𝛼<sub>𝐴</sub>, resulting in a significantly flattened curve. As a result, the predicted 𝑉<sub>𝐼</sub> − 𝛼<sub>𝐴</sub> relationship has a concave shape in the semi-log plots.

      (7) Does this Extrusion model give well well-known adder per origin, i.e., initiation to initiation is an adder.

      Yes, the extrusion model can provide the initiation-to-initiation adder phenomenon, this information was provided in fig. S3C.

      (8) DnaA protein or activity is never measured; mRNA is treated as a linear proxy. Yet the authors' own narrative stresses post-translational (not transcriptional) control of DnaA. Without parallel immunoblots or activity readouts, it is impossible to know whether a sixfold mRNA increase truly yields a proportional rise in active DnaA.

      We acknowledge the reviewer's valid concern regarding the indirect nature of our DnaA activity measurements. While mRNA levels alone cannot resolve active DnaA dynamics, our approach integrates functional replication outcomes with a validated synthetic reporter to infer activity. Crucially, elevated dnaA mRNA causes demonstrable biological effects: earlier replication initiation (Fig. 2F) and increased oriC copies (Fig. 2E), directly confirming enhanced functional DnaA activity at the oriC locus. The P<sub>syn66</sub> reporter, engineered with DnaA-boxes mirroring oriC's architecture, provides orthogonal validation, showing progressive repression to dnaA induction (Fig. 3C). Our operational metric , bases on P<sub>syn66</sub> responds sensitively to DnaA-chromosome interactions within its characterized 8-fold dynamic range (Fig. 3C). Immunoblots would be inadequate here, as they cannot distinguish functionally critical pools: free versus chromosome-bound DnaA, or DnaA-ATP versus DnaAADP, precisely the post-translational states our study implicates in regulation. We therefore prioritize functional readouts (initiation timing) and the P<sub>syn66</sub> reporter, which probes the biologically active fraction relevant to replication control.

      (9) Figure 2 infers both initiation mass and oriC copy number from bulk measurements (OD<sub>600</sub> per cell and rifampicin-cephalexin run-out) instead of measuring them directly in single cells. Any DnaA-dependent changes in cell size, shape, or antibiotic permeability could skew these bulk proxies, so the plotted relationships may not accurately reflect true initiation events.

      We acknowledge the reviewer's valid methodological concern and clarify that while bulk measurements carry inherent limitations, our approach is grounded in established techniques with demonstrated reliability. Cell mass was inferred from OD600/cell, which correlates strongly with direct dry weight measurements and microscopic cell volumes across diverse growth conditions, as validated in our prior work (Zheng et al., Nature Microbiology 2020). Crucially, cell mass remained invariant across dnaA expression levels (Fig. 2D).

      Regarding oriC quantification, the rifampicin-cephalexin run-out assay is a wildly applied for replication initiation studies. Our data shows expected 2<sup>n</sup> oriC distributions without abnormal ploidy (as shown below). While single-cell methods offer superior resolution, our bulk approach provides accurate population-level trends.

      Author response image 1.

      Recommendations for the authors:

      Reviewing Editor Comments:

      The reviewers felt that the mathematical modeling was not adequately explained in the paper, and that this affected the readability of the manuscript. The authors are encouraged to elaborate on this aspect of the paper (in addition to strengthening other claims, if possible, per the reviewers' comments).

      We thank the editor and reviewers for their constructive feedback. We have comprehensively strengthened the mathematical modeling framework to enhance clarity and rigor.

      Reviewer #1 (Recommendations for the authors):

      The only revision I would do is a recalibration of the claims and a major effort to clarify the modeling part (including a detailed SI appendix), without necessarily performing additional work.

      To enhance mathematical modeling transparency, we have completed model description in the method section and a parameter table with literature-sourced values in Supplementary Information Table S6. Moreover, analytical derivations of initiation mass dependencies are performed and presented in the Supplementary Information Note S1.

      Of course, there are extra experiments (mentioned in the public review) that would help support some of the big claims, but that can be considered a different project.

      Thank you for your suggestion. This will be addressed in our future work.

      Minor suggestion: please put signposts or plot jointly to compare the maxima/minima in Figures 4D, E, G, and H.

      We added dashed lines in Figures 4D, and E, to synchronize visualization of DnaA activity peaks and transcriptional minima across panels, facilitating direct biological comparisons.

      Reviewer #2 (Recommendations for the authors):

      (1) Should define what DNA activity is.

      We have explicitly defined DnaA activity in the Introduction as “the capacity to initiate replication…” and noted that it is “governed by free DnaA concentration, DnaA-ATP/-ADP ratio, and orisome assembly competence”.

      (2) Word repetition - “...grown in in Luria-Bertani (LB) medium...”.

      Corrected.

      (3) Typographical error - “FISH ... was preformed" should be "performed”.

      Corrected.

      (4) The manuscript alternates between “ng ml<sup>-1</sup>” and “ng·ml<sup>-1</sup>”; choose one style and apply it uniformly.

      Standardized the units to ng·ml<sup>-1</sup> throughout.

      (5) Reference duplicates - Some citations appear twice in the bibliography (e.g., "Bintu et al., 2005a/b" and "Bintu et al., 2005b" listed again later).

      The studies by Bintu et al. (2005a, 2005b) represent separate works: 2005a details applications, and 2005b develops models.

    1. Note: This response was posted by the corresponding author to Review Commons. The content has not been altered except for formatting.

      Learn more at Review Commons


      Reply to the reviewers

      Note: This preprint has been reviewed by subject experts for Review Commons. Content has not been altered except for formatting.


      Reply to the Reviewers

      We thank the reviewers for their positive assessments overall and for many helpful suggestions for clarification to make the manuscript more accessible to a broader audience. We made minor text changes and added more labels to the figures to address these comments.

      • *

      __Referee #1

      __

      Summary: In this study, the authors show a genetic interaction of the lipid receptors Lpr-1, Lpr-3 and Scav-2 in C. elegans. They show that Lpr-1 loss-of-function specifically affects aECM localization of Lpr-3 and attribute the lethality of Lpr-1 mutants to this phenotype. The authors performed a mutagenesis screen and identified a third lipid receptor, Scav-2, as a modulating factor: loss of scav-2 partially rescues the Lpr-1 phenotype. The authors created a variety of tools for this study, notably Crispr-Cas9-mediated knock-ins for endogenous tagging of the receptors.

      Major comments:

      1. while the authors provide a nice diagram showing the potential roles and interplay of lpr-1, lpr-3 and scav-2, it remains unclear what their respective cargo is. The nature of interaction between the proteins remains unclear from the data.

      Response

      • We agree that identifying the relevant cargo(s) will be key to understanding the detailed mechanisms involved and that the lack of such information is a limitation of our study. However, the impact of our study is to show that these lipid transporters functionally interact to affect aECM organization, a role that could be relevant to many systems, including humans.

      As an optional (since time-consuming) experiment I would suggest trying more tissue-specific lipidomics.

      Response

      • This would be an interesting future experiment but is outside our current technical capabilities.

      The lipidomics data should be presented in the figures, even if there were no significant changes. Importantly, show the lipid abundance at least of total lipids, better of individual classes, normalized to the material input (e.g. number of embryos, protein).

      Response

      • The reviewer is right to point out that lipid variations could occur at different levels, and that we should exercise caution. However, the unsupervised lipidomics analysis would have detected not only individual lipid variations, but also variations in the total or subgroup lipid content. Indeed, the eggs were weighed prior to extraction and each sample was extracted with the same precise volume of solvent before analysis. Furthermore, the LC-MS/MS injection sequence included blanks and quality control (QC) samples. The blanks were the extraction solvent, which allowed us to control for features unrelated to the biological samples. The QC sample was a mixture of all the samples included in the injection sequence, reflecting the central values of the model. If a subclass of samples, such as the lpr-1 mutant, had been characterized by a decrease in one lipid, a subgroup of lipids, or all lipids, it would have clustered separately. Instead, our PCA showed that the variation between samples of the same genotype (wild type, lpr-1 mutant, or lpr-1; scav-2) was similar to the variation between samples from two different genotypes. This means that we did not detect modifications to lipid quantity specifically or in total. A figure illustrating the lipid contents would show no difference between groups.

      Figure 1g: I do not understand what the lpr3:gfp signal is: the punctae in the overview image? and where are they in the zoom image showing anulli and alae? Also, how where the anulli and alae structures labeled? please provide more information

      Response

      • All of the fluorescent signal shown in this figure panel corresponds to the indicated LPR fusion - no other labelling method was used. SfGFP::LPR-3 labels the matrix structures (alae and annuli) as well as some puncta – the ratio of matrix to puncta changes over developmental stages. We edited the figure legend to make this more clear.

      One point that is not sufficiently adressed is that the authors deduce from the inability of the scav-2 gfp knock in to suppress lpr1 lethality that scav2 function is not impaired. This is quite indirect. Can the authors provide more convincing evidence that scav-2 ki has normal function?

      Response

      • Suppression of lpr-1 (or other aECM mutant) lethality is the only known phenotype caused by loss of scav-2 Therefore, this is the only phenotype for which we can do a rescue experiment to test functionality of the knock-in. The data presented do indicate that the knock-in fusion retains significant function.

      In general, the data is clearly presented and the statistical analyses look sound.

      Response

      • Thank you

      __Minor comments: __

      Please provide page and line numbers!

      Response:

      • done

      Avoid contractions like "don't" in both text and figure legends

      Response:

      • changed one instance of “don’t” to “do not”

      Page 12: I do not understand the meaning of the sentence "This transgene also caused more modest lethality in a wild-type background"

      Response:

      • Wording changed to “This transgene caused very little lethality in a wild-type background (Fig. 6C), indicating it is not generally toxic.”

      Figure 7: what is meant with "Dodt"?

      Response:

      • Dodt gradient contrast imaging is a method for transmitted light imaging similar to DIC and is used on some confocal microscopes. It is now explained in the Methods section. We removed the Dodt label from Figure 7 since it seems to be confusing and it is not really important whether the brightfield image is DIC or Dodt.

        Reviewer #1 (Significance (Required)):

        The study is experimentally sound and uses numerous novel tools, such as endogenously tagged lipid receptors. It is an interesting study for researchers in basic research studying lipid receptors and ECM biology. It provides insights on the genetic interaction of lipid receptors. My expertise is in lipid biochemistry, inter-organ lipid trafficking and imaging. I am not very familiar with C. elegans genetics.

      __Referee #2 __ 1. The manuscript is very well written; the documentation is fine, but some more details are needed for better following the subject for readers not familiar with nematode anatomy.

      For instance, while alae are somehow explained, annuli are not - structures that look abnormal in lpr1 and lpr1-scav2 mutants (Fig. 5B).

      Response

      • Apologies for this oversight. We added annuli labels to Figure 1 and Figure 5 panels and added descriptions of annuli to the Figure 1 legend and the Results text.

      Moreover, the authors show in Fig. 1 the punctae etc in the epidermis, whereas in Fig. 2 the show Lpr3 accumulation or not in the duct and the pore (lpr1). How do they localize in the cells of these structures at high magnification? It is also important to see the Lpr3 localisation in lpr1 mutants shown in Fig. 2A with the quality of the images shown in Fig. 1F. This applies also to Figs. 4 and 5.

      Responses:

      • The embryonic duct and pore cells are very small and we have not reliably seen puncta within them. In Figs 2 and 5, we supplemented the duct and pore images with those from the epidermis, which is a much larger tissue, allowing us to resolve puncta and matrix structures with better resolution.
      • The laser settings in Figs 2,4,5 (as opposed to Fig. 1) were chosen to avoid saturation of the matrix signal so that we could do accurate quantifications as shown. The images are unmodified with respect to brightness and therefore appear relatively dim – but we think they convey the observations very accurately.

      I would like to see punctae in lpr1-scav2 doubles.

      Response:

      • Puncta in this genotype are shown for the epidermis in Figure 5. It has not been possible to see puncta specifically within the embryonic duct and pore.

      Regarding the central mechanism, one possibility is - what the authors describe - that Lpr1 is needed for Lpr3 accumulation in ducts and tubes. Alternatively, Lpr1 is needed for duct and tube expansion, in lack of which Lpr3 is unable to reach its destination that is the lumina. Scav2, in this scenario, might be antagonist of tube and duct expansion, and thereby rescue the Lpr1 mutant phenotype independently. Admittedly, the non-accumulation of Lpr3 in scav2 mutants argues against a lpr1-independent function of scav2.

      Responses:

      • LPR-1 is indeed needed to maintain duct and pore tube integrity as the tubes grow, but in mutants the tubes appear to collapse at a later stage than we imaged here (Stone et al 2009). The ~normal accumulation of LET-4 and LET-653 further argues that the duct and pore tubes are still intact at the 1.5-to-2-fold stages. Therefore, we conclude that the defect in LPR-3 accumulation precedes duct and pore collapse.
      • The changes we document in the epidermis also show that the lpr-1 mutant affects LPR-3 accumulation in another (non-tube) tissue.

      In any case, to underline the aspect of Lpr1-Scav2 dosage relationship, the authors may also have a look at Lpr3 distribution in lpr1 heterozygous, and lpr1-scav2 double heterozygous worms. In this spirit, it would be interesting to see the semi-dominant effects of scav2 on Lpr3 localisation in lpr1 mutants by microscopy.

      Response:

      • Because of the hermaphroditism of C. elegans, it would be technically challenging to confidently identify heterozygous (vs. homozygous) embryos for confocal imaging. We do not think that the results would be informative enough to warrant the effort, given that we’ve already shown that scav-2 heterozygosity can partly suppress lpr-1 The expectation is that LPR-3 levels would be partially restored in the scav-2 het, but it might take a very large sample size to confidently assess that partial effect.

      One word to the overexpression studies: it is surprising that the amounts of Scav2 delivered by the expression through the grl-2 promoter in the lpr1, scav2 background are almost matching those by the opposite effect of scav2 mutations on lpr1 dysfunction.

      Response:

      • The reviewer refers to the transgenic rescue experiment with the grl-2pro::SCAV-2 transgene. Because the scav-2 mutant phenotype being tested is suppression of lpr-1 lethality, the expected result from scav-2 rescue is to restore the lpr-1 lethal phenotype to the strain. This is exactly the result we see. We have revised the text to more clearly explain the logic.

      One issue concerns the localization of scav2-gfp "rarely" in vesicles: what are these vesicles?

      Response

      • Only a handful of vesicles were seen across all the images we collected, and we have not yet identified them. They could be associated with either SCAV-2 delivery or removal from the plasma membrane, as now stated in the text. SCAV-2 trafficking would be an interesting area for further study but is beyond the scope of this paper.

      One comment to the Let653 transgenes/knock-ins: the localization of transgenic Let653-gfp may be normal in lpr1 mutants because there are wild-type copies in the background.

      Response

      • There are wild type copies of LET-653 in the background, but no wild type copies of LPR-1. Even if the untagged LET-653 would be recruiting the tagged LET-653 as the reviewer suggests, we can still conclude that lpr-1 loss does not prevent the untagged LET-653 (and thus also the tagged LET-653) from accumulating in the duct lumen matrix.

      One thought to the model: if Scav2 has a function in a lpr1 background, this means that yet another transporter X delivers the substrate for Scav2, isn't it?

      Response

      • Yes, we completely agree with this interpretation and have revised the discussion and Figure 8 legend to more explicitly make this point.

      A word to the term haploinsifficient that is used in this study: scav2 mutants would be haploinsifficient if the heterozygous worms died in an otherwise wild-type background.

      Response

      • We disagree with this comment. The term “haploinsufficient” simply means that heterozygosity for a deletion or other loss of function allele can cause a mutant phenotype – the term is not restricted to lethal phenotypes.

        Reviewer #2 (Significance (Required)):

        Alexandra C.Belfi and colleagues wrote the manuscript entitled "Opposing roles for lipocalins and a CD36 family scavenger receptor in apical extracellular matrix-dependent protection of narrow tube integrity" in which they report on their findings on the genetic and cell-biological interaction between the lipid transporters Lpr1 and scav2 in the nematode C. elegans. In principle, these two proteins are involved in shaping the apical extracellular matrix (aECM) of ducts by regulating the amounts of Lpr3 in the extracellular space. While seems to act cell autonomously, Lpr1 has a non-cell autonomous effect on Lpr3.


      __Referee #3 __ Summary: Using a powerful combination of genetic and quantitative imaging approaches, Belfi et al., describe novel findings on the roles of several lipocalins-secreted lipid carrier proteins-in the production and organization of the apical extracellular matrix (aECM) required for small diameter tube formation in C. elegans. The work comprises a substantial extension of previous studies carried out by the Sundaram lab, which has pioneered studies into the roles of aECM and accessory proteins in creating the duct-pore excretion tube and which also plays a role in patterning of the epidermal cuticle. One core finding is that the lipocalin LPR-1 does not stably associate with the aECM but is instead required for the incorporation of another lipocalin, LPR-3. A second major finding is that reduction of function in SCAV-2, a SCARB family membrane lipid transporter, suppresses lpr-1 mutant lethality along with associated duct-pore defects and mislocalization of LPR-3. Likewise loss of scav-2 partially suppresses defects in two other aECM proteins and restores defects in LPR-3 localization in one of them (let-653). Additional genetic and protein localization studies lead to the model that LPR-1 and SCAV-2 may antagonistically regulate one or more lipid or lipoprotein factors necessary for LPR-3 localization and duct-pore formation. A role for LPR-1 and LPR-3 at lysosomes is clearly implicated based on co-localization studies, although a specific role for lysosomes (or related organelles) is not defined. Finally, MS data suggests that neither LPR-1 or SCAV-2 grossly affect lipid composition in embryos, consistent with dietary interventions failing to affect mutant phenotypes. Ultimately, a plausible schematic model is presented to explain for much of the data.

      __*Major comments:

      *__

      1. The studies are very thorough, convincing, and generally well described. Conclusions are logical and well grounded. Additional experiments are not required to support the authors major conclusions, and the data and methods are described in a sufficient detail to allow replication. As such my comments are minor and should be addressable at the author's discretion in writing.

      Response

      • Thank you for these positive comments

        __Minor comments: __2) In the abstract, "tissue-specific suppression" made me think that there was going to be a tissue-specific knockdown experiment, which was not the case. Rather scav-2 suppression is specific to the duct-pore, which corresponds to where scav-2 is expressed. Consider rewording this.

      Response

      • Wording was changed to “duct/pore-specific suppression”

        3) Page 5. Suggest wording change to, "Whereas LPR-3 incorporates stably into the precuticle, suggesting a structural role in matrix organization, LPR-1..."

      Response

      • Done

        4) LIMP-2 versus LIMP2. Both are used. Uniprot lists LIMP2, but some papers use LIMP-2. Choose one and be consistent.

      Response

      • Everything changed to LIMP2.

        5) Some of the data for S6 Fig wasn't referred to directly in the text. Namely results regarding pcyt-1 and pld-1. I'd suggest incorporating this into the results section possibly using, "As a control for our lipid supplementation experiments..."

      Response

      • These experiments are now described on page 11.

        6) Page 12 bottom. I understand the use of "oppose", but another way to put it is that SCAV-2 and LPR-1 (antagonistically or collectively) modulate aECM composition. Other terms that might confuse some readers is the use of upstream and downstream, although I OK with its use in the context of this work.

      Response

      • The genetics indicate that lpr-1 and scav-2 have opposite effects on tube shaping and LPR-3 localization, so they do function antagonistically rather than collectively/cooperatively; we decided to keep this terminology.

        7) Page 16. I understand the logic that SCAV-2 is unlikely to directly modulate LPR-3 given its presumed molecular function. But is it possible that LPR-3 levels are already maxed out in the aECM so that loss of SCAV-2 doesn't lead to any increase? Conversely, one could argue that even if acting indirectly, SCAV-2 could have led to increased LPR-3 levels, unless they were already maxed.

      Response

      • This is a good point and the possibility is now mentioned in the Results page 9. We also changed our wording in the Abstract and Discussion to acknowledge the possibility that LPR-3 could be the SCAV-2 cargo, though we still don’t favor this model.

        8) Figure legend 1. I did not see an asterisk in figure 1B.

      Response

      • thanks for catching this error, text removed

        9) Figure 1C. Might want to define the "degree" term in the legend for people outside the field.

      Response

      • We added an explanation to the figure legend.

        10) Fig 1 G. I was just wondering if cuticle autofluorescence was an issue for taking these images.

      Response

      • Cuticle auto fluorescence is generally quite dim in L4s with our settings, and it was not an issue at this mid/late L4 stage, which corresponds to when both LPR fusions are at their brightest. Note that both large panels are MAX projections and yet you can’t see any cuticle auto-fluorescence in the LPR-1 panel.

        11) Fig 2 and others. Please define error bars.

      Response

      • These correspond to the standard deviation; this information is now added to the Methods.

        12) Fig 5. From the images, it looks like lpr-1; scav-2 doubles might have a worse (pre)cuticle defect in LPR-3 localization than lpr-1 singles. If so that would be interesting and would suggest that their relationship with respect to the modulation of LPR-3 is context dependent. Admittedly, the lack of obvious scav-2 expression in the epidermis would not be consistent with an effect (positive or negative).

      Response

      • The lpr-1 scav-2 strain is certainly not improved over lpr-1 but we have not noted any consistent worsening of the phenotype either.

        13) Consider defining Dodt in the first figure legend where it appears.

      Response

      • Dodt gradient contrast imaging is a method of transmitted light imaging similar to DIC and is used on some confocal microscopes. It is now explained in the Methods section. We removed the term from Figure 7 since it seems to be confusing.

        14) For Mander's, is there a reason to report just one of the two findings (M1 or M2) versus both?

      Response

      • We now include the 2nd Manders value in the figure legend and note that value is much lower (0.25) because much of the red signal is lysosomes (where green would be quenched by acidity).

        15) Consider referring to specific panels (A, B...) within references to the supplemental files.

      Response

      • done

        16) Fig S6E. Neither "increasing nor increasing" to "increasing nor decreasing".

      Response

      • fixed

        **Referees cross-commenting**

        I thought that Reviewers 1 and 2 brought up some good points. My sense is that Belfi and colleagues can address most of these in writing, but are of course welcome to add new data as they see fit. I get that it's not a "perfect" paper where everything is explained fully or comes together, but I don't see that as a flaw that needs to be fixed. I think that the manuscript represents a good deal of work (as it is) and provides a sufficient advance while also suggesting an interesting link to disease. It will be up to individual journals to decide if the findings meets their criteria.

        Reviewer #3 (Significance (Required)):

        Significance: The work carried out in this paper, and more generally by the Sundaram lab, always has a ground-breaking element because very few labs in the field have studied in detail the developmental roles and regulation of the aECM, in large part because it can be challenging to dissect. The core findings in this study are rather novel and unexpected, namely the opposing roles of the paralogous LPR-1 and LPR-3 lipocalins and their functional interactions with SCAV-2. The study does stop short of finding specific molecules (lipid or lipoprotein) that would mediate the effects they report, and it wasn't yet clear how the lysosomal co-loc plays a role, but this is not a criticism of the work presented or the forward progress. I was particularly intrigued by the idea, presented in the discussion, that disruption of vascular aECM could potentially account for some of the (complex) observations regarding the role of lipocalins and SCARB proteins in human disease. This would represent a new avenue for researchers to consider and underscores the power of using non-biased approaches in model systems.

        As for all my reviews, this is signed by David Fay.

      • *

    1. Overall thoughts: This is an interesting history piece regarding peer review and the development of review over time. Given the author’s conflict of interest and association with the Centre developing MetaROR, I think that this paper might be a better fit for an information page or introduction to the journal and rationale for the creation of MetaROR, rather than being billed as an independent article. Alternatively, more thorough information about advantages to pre-publication review or more downsides/challenges to post-publication review might make the article seem less affiliated. I appreciate seeing the history and current efforts to change peer review, though I am not comfortable broadly encouraging use of these new approaches based on this article alone.

      Page 3: It’s hard to get a feel for the timeline given the dates that are described. We have peer review becoming standard after WWII (after 1945), definitively established by the second half of the century, an example of obligatory peer review starting in 1976, and in crisis by the end of the 20th century. I would consider adding examples that better support this timeline – did it become more common in specific journals before 1976? Was the crisis by the end of the 20th century something that happened over time or something that was already intrinsic to the institution? It doesn’t seem like enough time to get established and then enter crisis, but more details/examples could help make the timeline clear. 

      Consider discussing the benefits of the traditional model of peer review.

      Table 1 – Most of these are self-explanatory to me as a reader, but not all. I don’t know what a registered report refers to, and it stands to reason that not all of these innovations are familiar to all readers. You do go through each of these sections, but that’s not clear when I initially look at the table. Consider having a more informative caption. Additionally, the left column is “Course of changes” here but “Directions” in text. I’d pick one and go with it for consistency.

      3.2: Considering mentioning your conflict of interest here where MetaROR is mentioned.

      With some of these methods, there’s the ability to also submit to a regular journal. Going to a regular journal presumably would instigate a whole new round of review, which may or may not contradict the previous round of post-publication review and would increase the length of time to publication by going through both types. If someone has a goal to publish in a journal, what benefit would they get by going through the post-publication review first, given this extra time?

      There’s a section talking about institutional change (page 14). It mentions that openness requires three conditions – people taking responsibility for scientific communication, authors and reviewers, and infrastructure. I would consider adding some discussion of readers and evaluators. Readers have to be willing to accept these papers as reliable, trustworthy, and respectable to read and use the information in them. Evaluators such as tenure committees and potential employers would need to consider papers submitted through these approaches as evidence of scientific scholarship for the effort to be worthwhile for scientists.

      Based on this overview, which seems somewhat skewed towards the merits of these methods (conflict of interest, limited perspective on downsides to new methods/upsides to old methods), I am not quite ready to accept this effort as equivalent of a regular journal and pre-publication peer review process. I look forward to learning more about the approach and seeing this review method in action and as it develops.

    2. Response to the Editors and the Reviewers

      I am sincerely grateful to the editors and peer reviewers at MetaROR for their detailed feedback and valuable comments and suggestions. I have addressed each point below.

      Handling editor

      1. “However, the article’s progression and arguments, along with what it seeks to contribute to the literature need refinement and clarification. The argument for PRC is under-developed due to a lack of clarity about what the article means by scientific communication. Clarity here might make the endorsement of PRC seem like less of a foregone conclusion.”

      The structure of the paper (and discussion) has changed significantly to address the feedback.

      2. “I strongly endorse the main theme of most of the reviews, which is that the progression and underlying justifications for this article’s arguments needs a great deal of work. In my view, this article’s main contribution seems to be the evaluation of the three peer review models against the functions of scientific communication. I say ‘seems to be’ because the article is not very clear on that and I hope you will consider clarifying what your manuscript seeks to add to the existing work in this field. In any case, if that assessment of the three models is your main contribution, that part is somewhat underdeveloped. Moreover, I never got the sense that there is clear agreement in the literature about what the tenets of scientific communication are. Note that scientific communication is a field in its own right.”

      I have implemented a more rigorous approach to argumentation in response. “Scientific communication” was replaced by “scholarly communication.”

      3. “I also agree that paper is too strongly worded at times, with limitations and assumptions in the analysis minimised or not stated. For example, all of the typologies and categories drawn could easily be reorganised and there is a high degree of subjectivity in this entire exercise. Subjective choices should be highlighted and made salient for the reader. Note that greater clarity, rigour, and humility may also help with any alleged or actual bias.”

      I have incorporated the conceptual framework and description of the research methodology. However, the Discussion section reflects my personal perspective in some points, which I have explicitly highlighted to ensure clarity.

      4. “I agree with Reviewer 3 that the ‘we’ perspective is distracting.”

      This has been fixed.

      5. “The paragraph starting with ‘Nevertheless’ on page 2 is very long.”

      The text was restructured.

      6. “There are many points where language could be shortened for readability, for example:

      Page 3: ‘decision on publication’ could be ‘publication decision’.

      Page 5: ‘efficiency of its utilization’ could be ‘its efficiency’.

      Page 7: ‘It should be noted…’ could be ‘Note that…’.”

      I have proofread the text.

      7. “Page 7: ‘It should be noted that..’ – this needs a reference.”

      This statement has been moved to the Discussion section, paraphrased, and reference added

      “It should be also noted that peer review innovations pull in opposing directions, with some aiming to increase efficiency and reduce costs, while others aim to promote rigor and increase costs (Kaltenbrunner et al., 2022).”

      8. “I’m not sure that registered reports reflect a hypothetico-deductive approach (page 6). For instance, systematic reviews (even non-quantitative ones) are often published as registered reports and Cochrane has required this even before the move towards registered reports in quantitative psychology.”

      I have added this clarification.

      9. “I agree that modular publishing sits uneasily as its own chapter.”

      Modular publishing has been combined with registered reports into the deconstructed publication group of models, now Section 5.1.

      10. “Page 14: ‘The "Publish-Review-Curate" model is universal that we expect to be the future of scientific publishing. The transition will not happen today or tomorrow, but in the next 5-10 years, the number of projects such as eLife, F1000Research, Peer Community in, or MetaROR will rapidly increase’. This seems overly strong (an example of my larger critique and that of the reviewers).”

      This part of the text has been rewritten.

      Reviewer 1

      11. “For example, although Model 3 is less chance to insert bias to the readers, it also weakens the filtering function of the review system. Let’s just think about the dangers of machine-generated articles, paper-mills, p-hacked research reports and so on. Although the editors do some pre-screening for the submissions, in a world with only Model 3 peer review the literature could easily get loaded with even more ‘garbage’ than in a model where additional peers help the screening.”

      I think that generated text is better detected by software tools. At the same time, I tried and described the pros and cons of different models in a more balanced way in the concluding section.

      12. “Compared to registered reports other aspects can come to focus that Model 3 cannot cover. It’s the efficiency of researchers’ work. In the care of registered reports, Stage 1 review can still help researchers to modify or improve their research design or data collection method. Empirical work can be costly and time-consuming and post-publication review can only say that ‘you should have done it differently then it would make sense’.”

      Thank you very much for this valuable contribution, I have added this statement at P. 11.

      13. “Finally, the author puts openness as a strength of Model 3. In my eyes, openness is a separate question. All models can work very openly and transparently in the right circumstances. This dimension is not an inherent part of the models.”

      I think that the model, providing peer reviews to all the submissions, ensures maximum transparency. However, I have made effort to make the wording more balanced and distinguish my personal perspective from the literature.

      14. “In conclusion, I would not make verdict over the models, instead emphasize the different functions they can play in scientific communication.”

      This idea has been reflected now in the concluding section.

      15. “A minor comment: I found that a number of statements lack references in the Introduction. I would have found them useful for statements such as ‘There is a point of view that peer review is included in the implicit contract of the researcher.’”

      Thank you for your feedback. I have implemented a more rigorous approach to argumentation in response.

      Reviewer 2

      16. “The primary weakness of this article is that it presents itself as an 'analysis' from which they 'conclude' certain results such as their typology, when this appears clearly to be an opinion piece. In my view, this results in a false claim of objectivity which detracts from what would

      otherwise be an interesting and informative, albeit subjective, discussion, and thus fails to discuss the limitations of this approach.”

      I have incorporated the conceptual framework and description of the research methodology. However, the Discussion section reflects my personal perspective in some points, which I have explicitly highlighted to ensure clarity.

      17. “A secondary weakness is that the discussion is not well structured and there are some imprecisions of expression that have the potential to confuse, at least at first.”

      The structure of the paper (and discussion) has changed significantly.

      18. “The evidence and reasoning for claims made is patchy or absent. One instance of the former is the discussion of bias in peer review. There are a multitude of studies of such bias and indeed quite a few meta-analyses of these studies. A systematic search could have been done here but there is no attempt to discuss the totality of this literature. Instead, only a few specific studies are cited. Why are these ones chosen? We have no idea. To this extent I am not convinced that the references used here are the most appropriate.”

      I have reviewed the existing references and incorporated additional sources. However, the study does not claim to conduct a systematic literature review; rather, it adopts an interpretative approach to literature analysis.

      19. “Instances of the latter are the claim that ‘The most well-known initiatives at the moment are ResearchEquals and Octopus’ for which no evidence is provided, the claim that ‘we believe that journal-independent peer review is a special case of Model 3’ for which no further argument is provided, and the claim that ‘the function of being the "supreme judge" in deciding what is "good" and "bad" science is taken on by peer review’ for which neither is provided.

      Thank you for your feedback. I have implemented a more rigorous approach to argumentation in response.

      20. “A particular example of this weakness, which is perhaps of marginal importance to the overall paper but of strong interest to this reviewer is the rather odd engagement with history within the paper. It is titled "Evolution of Peer Review" but is really focussed on the contemporary state-of-play. Section 2 starts with a short history of peer review in scientific publishing, but that seems intended only to establish what is described as the 'traditional' model of peer review. Given that that short history had just shown how peer review had been continually changing in character over centuries - and indeed Kochetkov goes on to describe further changes - it is a little difficult to work out what 'traditional' might mean here; what was 'traditional' in 2010 was not the same as what was 'traditional' in 1970. It is not clear how seriously this history is being taken. Kochetkov has earlier written that "as early as the beginning of the 21st century, it was argued that the system of peer review is 'broken'" but of course criticisms - including fundamental criticisms - of peer review are much older than this. Overall, this use of history seems designed to privilege the experience of a particular moment in time, that coincides with the start of the metascience reform movement.”

      While the paper addresses some aspects of peer review history, it does not provide a comprehensive examination of this topic. A clarifying statement to this effect has been included in the methodology section.

      “… this section incorporates elements of historical analysis, it does not fully qualify as such because primary sources were not directly utilized. Instead, it functions as an interpretative literature review, and one that is intentionally concise, as a comprehensive history of peer review falls outside the scope of this research”.

      21. “Section 2 also demonstrates some of the second weakness described, a rather loose structure. Having moved from a discussion of the history of peer review to detail the first model, 'traditional' peer review, it then also goes on to describe the problems of this model. This part of the paper is one of the best - and best - evidenced. Given the importance of it to the main thrust of the discussion it should probably have been given more space as a Section all on its own.”

      This section (now Section 4) has been extended, see also previous comment.

      22. “Another example is Section 4 on Modular Publishing, in which Kochetkov notes "Strictly speaking, modular publishing is primarily an innovative approach for the publishing workflow in general rather than specifically for peer review." Kochetkov says "This is why we have placed this innovation in a separate category" but if it is not an innovation in peer review, the bigger question is 'Why was it included in this article at all?'.”

      Modular publishing has been combined with registered reports into the deconstructed publication group of models, now Section 5.1.

      23. “One example of the imprecisions of language is as follows. The author also shifts between the terms 'scientific communication' and 'science communication' but, at least in many contexts familiar to this reviewer, these are not the same things, the former denoting science-internal dissemination of results through publication (which the author considers), conferences and the like (which the author specifically excludes) while the latter denotes the science-external public dissemination of scientific findings to non-technical audiences, which is entirely out of scope for this article.”

      Thank you for your remark. As a non- native speaker, I initially did not grasp the distinction between the terms. However, I believe the phrase ‘scholarly communication’ is the most universally applicable term. This adjustment has now been incorporated into the text.

      24. “A final note is that Section 3, while an interesting discussion, seems largely derivative from a typology of Waltman, with the addition of a consideration of whether a reform is 'radical' or 'incremental', based on how 'disruptive' the reform is. Given that this is inherently a subjective decision, I wonder if it might not have been more informative to consider 'disruptiveness' on a scale and plot it accordingly. This would allow for some range to be imagined for each reform as well; surely reforms might be more or less disruptive depending on how they are implemented. Given that each reform is considered against each model, it is somewhat surprising that this is not presented in a tabular or graphical form.”

      Ultimately, I excluded this metric due to its current reliance on purely subjective judgment. Measuring 'disruptiveness', e.g., through surveys or interviews remains a task for future research.

      25. “Reconceptualize this as an opinion piece. Where systematic evidence can be drawn upon to make points, use that, but don't be afraid to just present a discussion from what is clearly a well-informed author.”

      I cannot definitively classify this work as an opinion piece. In fact, this manuscript synthesizes elements of a literature review, research article, and opinion essay. My idea was to integrate the strengths of all three genres.

      26. “Reconsider the focus on history and 'evolution' if the point is about the current state of play and evaluation of reforms (much as I would always want to see more studies on the history and evolution of peer review).”

      I have revised the title to better reflect the study’s scope and explicitly emphasize its focus on contemporary developments in the field.

      “Peer Review at the Crossroads”

      27. “Consider ways in which the typology might be expanded, even if at subordinate level.”

      I have updated the typology and introduced the third tier, where it is applicable (see Fig.2).

      Reviewer 3

      28. “In my view, the biggest issue with the current peer review system is the low quality of reviews, but the manuscript only mentions this fleetingly. The current system facilitates publication bias, confirmation bias, and is generally very inconsistent. I think this is partly due to reviewers’ lack of accountability in such a closed peer review system, but I would be curious to hear the author’s ideas about this, more elaborately than they provide them as part of issue 2.

      I have elaborated on this issue in the footnote.

      29. “I’m missing a section in the introduction on what the goals of peer review are or should be. You mention issues with peer review, and these are mostly fair, but their importance is only made salient if you link them to the goals of peer review. The author does mention some functions of peer review later in the paper, but I think it would be good to expand that discussion and move it to a place earlier in the manuscript.”

      The functions of peer review are summarized in the first paragraph of Introduction.

      30. “Table 1 is intuitive but some background on how the author arrived at these categorizations would be welcome. When is something incremental and when is something radical? Why are some innovations included but not others (e.g., collaborative peer review, see https://content.prereview.org/how-collaborative-peer-review-can-transform-scientific-research/)?”

      Collaborative peer review, namely, Prereview was mentioned in the context of Model 3 (Publish-Review-Curate). However, I have extended this part of the paper.

      31“‘Training of reviewers through seminars and online courses is part of the strategies of many publishers. At the same time, we have not been able to find statistical data or research to assess the effectiveness of such training.’ (p. 5)  There is some literature on this, although not recent. See work by Sara Schroter for example, Schroter et al., 2004; Schroter et al., 2008)”

      Thank you very much, I have added these studies and a few more recent ones.

      32. “‘It should be noted that most initiatives aimed at improving the quality of peer review simultaneously increase the costs.’ (p. 7) This claim needs some support. Please explicate why this typically is the case and how it should impact our evaluations of these initiatives.”

      I have moved this part to the Discussion section.

      33. “I would rephrase “Idea of the study” in Figure 2 since the other models start with a tangible output (the manuscript). This is the same for registered reports where they submit a tangible report including hypotheses, study design, and analysis plan. In the same vein, I think study design in the rest of the figure might also not be the best phrasing. Maybe the author could use the terminology used by COS (Stage 1 manuscript, and Stage 2 manuscript, see Details & Workflow tab of https://www.cos.io/initiatives/registered-reports). Relatedly, “Author submits the first version of the manuscript” in the first box after the ‘Manuscript (report)’ node maybe a confusing phrase because I think many researchers see the first version of the manuscript as the stage 1 report sent out for stage 1 review.”

      Thank you very much. Stage 1 and Stage 2 manuscripts look like suitable labelling solution.

      34. “One pathway that is not included in Figure 2 is that authors can decide to not conduct the study when improvements are required. Relatedly, in the publish-review-curate model, is revising the manuscripts based on the reviews not optional as well? Especially in the case of

      3a, authors can hardly be forced to make changes even though the reviews are posted on the platform.”

      All the four models imply a certain level of generalization; thus, I tried to avoid redundant details. However, I have added this choice to the PRC model (now, Model 4).

      35. “I think the author should discuss the importance of ‘open identities’ more. This factor is now not explicitly included in any of the models, while it has been found to be one of the main characteristics of peer review systems (Ross-Hellauer, 2017).”

      This part has been extended.

      36. “More generally, I was wondering why the author chose these three models and not others. What were the inclusion criteria for inclusion in the manuscript? Some information on the underlying process would be welcome, especially when claims like ‘However, we believe that journal-independent peer review is a special case of Model 3 (‘Publish-Review-Curate’).’ are made without substantiation.”

      The study included four generalized models of peer review that involved some level of abstraction.

      37. “Maybe it helps to outline the goals of the paper a bit more clearly in the introduction. This helps the reader to know what to expect.”

      The Introduction has been revised including the goal and objectives.

      38. “The Modular Publishing section is not inherently related to peer review models, as you mention in the first sentence of that paragraph. As such, I think it would be best to omit this section entirely to maintain the flow of the paper. Alternatively, you could shortly discuss it in the discussion section but a separate paragraph seems too much from my point of view.”

      Modular publishing has been combined with registered reports into the fragmented publishing group of models, now in Section 5.

      39. “Labeling model 3 as post-publication review might be confusing to some readers. I believe many researchers see post-publication review as researchers making comments on preprints, or submitting commentaries to journals. Those activities are substantially different from the publish-review-curate model so I think it is important to distinguish between these types.”

      The label was changed into Publish- Review-Curate model.

      40. “I do not think the conclusions drawn below Table 3 logically follow from the earlier text. For example, why are “all functions of scientific communication implemented most quickly and transparently in Model 3”? It could be that the entire process takes longer in Model 3 (e.g. because reviewers need more time), so that Model 1 and Model 2 lead to outputs quicker. The same holds for the following claim: ‘The additional costs arising from the independent assessment of information based on open reviews are more than compensated by the emerging opportunities for scientific pluralism.’ What is the empirical evidence for this? While I personally do think that Model 3 improves on Model 1, emphatic statements like this require empirical evidence. Maybe the author could provide some suggestions on how we can attain this evidence. Model 2 does have some empirical evidence underpinning its validity (see Scheel, Schijen, Lakens, 2021; Soderberg et al., 2021; Sarafoglou et al. 2022) but more meta-research inquiries into the effectiveness and cost-benefits ratio of registered reports would still be welcome in general.”

      The Discussion section has been substantially revised to address this point. While I acknowledge the current scarcity of empirical studies on innovative peer review models, I have incorporated a critical discussion of this methodological gap. I am grateful for the suggested literature on RRs, which I have now integrated into the relevant subsection.

      41. “What is the underlaying source for the claim that openness requires three conditions?”

      I have made effort to clarify within the text that this reflects my personal stance.

      42. “‘If we do not change our approach, science will either stagnate or transition into other forms of communication.’ (p. 2) I don’t think this claim is supported sufficiently strongly. While I agree there are important problems in peer review, I think would need to be a more in-depth and evidence-based analysis before claims like this can be made.”

      The sentence has been rephrased.

      43. “On some occasions, the author uses ‘we’ while the study is single authored.”

      This has been fixed.

      44. “Figure 1: The top-left arrow from revision to (re-)submission is hidden”

      I have updated Figure 1.

      45. “‘The low level of peer review also contributes to the crisis of reproducibility in scientific research (Stoddart, 2016).’ (p. 4) I assume the author means the low quality of peer review.”

      This has been fixed.

      46. “‘Although this crisis is due to a multitude of factors, the peer review system bears a significant responsibility for it.’ (p. 4) This is also a big claim that is not substantiated”

      I have paraphrased this sentence as “While multiple factors drive this crisis, deficiencies in the peer review process remain a significant contributor.” and added a footnote.

      47. “‘Software for automatic evaluation of scientific papers based on artificial intelligence (AI) has emerged relatively recently” (p. 5) The author could add RegCheck (https://regcheck.app/) here, even though it is still in development. This tool is especially salient in light of the finding that preregistration-paper checks are rarely done as part of reviews (see Syed, 2023)”

      Thank you very much, I have added this information.

      48. “There is a typo in last box of Figure 1 (‘decicion’ instead of ‘decision’). I also found typos in the second box of Figure 2, where ‘screns’ should be ‘screens’, and the author decision box where ‘desicion’ should be ‘decision’”

      This has been fixed.

      49. “Maybe it would be good to mention results blinded review in the first paragraph of 3.2. This is a form of peer review where the study is already carried out but reviewers are blinded to the results. See work by Locascio (2017), Grand et al. (2018), and Woznyj et al. (2018).”

      Thanks, I have added this (now section 5.2)

      50. “Is ‘Not considered for peer review’ in figure 3b not the same as rejected? I feel that it is rejected in the sense that neither the manuscript not the reviews will be posted on the platform.”

      Changed into “Rejected”

      51. “‘In addition to the projects mentioned, there are other platforms, for example, PREreview12, which departs even more radically from the traditional review format due to the decentralized structure of work.’ (p. 11) For completeness, I think it would be helpful to add some more information here, for example why exactly decentralization is a radical departure from the traditional model.”

      I have extended this passage.

      52. “‘However, anonymity is very conditional - there are still many “keys” left in the manuscript, by which one can determine, if not the identity of the author, then his country, research group, or affiliated organization.’ (p.11) I would opt for the neutral ‘their’ here instead of ‘his’, especially given that this is a paragraph about equity and inclusion.”

      This has been fixed.

      53. “‘Thus, “closeness” is not a good way to address biases.’ (p. 11) This might be a straw man argument because I don’t believe researchers have argued that it is a good method to combat biases. If they did, it would be good to cite them here. Alternatively, the sentence could be

      omitted entirely.

      I have omitted the sentence.

      54. “I would start the Modular Publishing section with the definition as that allows readers to interpret the other statements better.”

      Modular publishing has been combined with registered reports into the deconstructed publication group of models, now in Section 5, general definition added.

      55. “It would be helpful if the Models were labeled (instead of using Model 1, Model 2, and Model 3) so that readers don’t have to think back what each model involved.”

      All the models represent a kind of generalization, which is why non-detailed labels are used. The text labels may vary depending on the context.

      56. “Table 2: ‘Decision making’ for the editor’s role is quite broad, I recommend to specify and include what kind of decisions need to be made.”

      Changed into “Making accept/reject decisions”

      57. “Table 2: ‘Aim of review’ – I believe the aim of peer review differs also within these models (see the ‘schools of thought’ the author mentions earlier), so maybe a statement on what the review entails would be a better way to phrase this.”

      Changed into “What does peer review entail?”

      58. “Table 2: One could argue that the object of the review’ in Registered Reports is also the manuscript as a whole, just in different stages. As such, I would phrase this differently.

      Current wording fits your remark: “Manuscript in terms of study design and execution”

      Reviewer 4

      59. “Page 3: It’s hard to get a feel for the timeline given the dates that are described. We have peer review becoming standard after WWII (after 1945), definitively established by the second half of the century, an example of obligatory peer review starting in 1976, and in crisis by the end of the 20th century. I would consider adding examples that better support this timeline – did it become more common in specific journals before 1976? Was the crisis by the end of the 20th century something that happened over time or something that was already intrinsic to the institution? It doesn’t seem like enough time to get established and then enter crisis, but more details/examples could help make the timeline clear. Consider discussing the benefits of the traditional model of peer review.”

      This section has been extended.

      60. “Table 1 – Most of these are self-explanatory to me as a reader, but not all. I don’t know what a registered report refers to, and it stands to reason that not all of these innovations are familiar to all readers. You do go through each of these sections, but that’s not clear when I initially look at the table. Consider having a more informative caption. Additionally, the left column is “Course of changes” here but “Directions” in text. I’d pick one and go with it for consistency.”

      Table 1 has been replaced by Figure 2. I have also extended text descriptions, added definitions.

      61. “With some of these methods, there’s the ability to also submit to a regular journal. Going to a regular journal presumably would instigate a whole new round of review, which may or may not contradict the previous round of post-publication review and would increase the length of time to publication by going through both types. If someone has a goal to publish in a journal, what benefit would they get by going through the post-publication review first, given this extra time?”

      Some of these platforms, e.g., F1000, Lifecycle Journal, replace conventional journal publishing. Modular publishing allows for step-by-step feedback from peers. An important advantage of RRs over other peer review models lies in their capacity to enhance research efficiency. By conducting peer review at Stage 1, researchers gain the opportunity to refine their study design or data collection protocols before empirical work begins. Other models of review can offer critiques such as "the study should have been conducted differently" without actionable opportunity for improvement. The key motivation for having my paper reviewed in MetaROR is the quality of peer review – I have never received so many comments, frankly! Moreover, platforms such as MetaROR usually have partnering journals.

      62. “There’s a section talking about institutional change (page 14). It mentions that openness requires three conditions – people taking responsibility for scientific communication, authors and reviewers, and infrastructure. I would consider adding some discussion of readers and evaluators. Readers have to be willing to accept these papers as reliable, trustworthy, and respectable to read and use the information in them. Evaluators such as tenure committees and potential employers would need to consider papers submitted through these approaches as evidence of scientific scholarship for the effort to be worthwhile for scientists.”

      I have omitted these conditions and employed the Moore’s Technology Adoption Life Cycle. Thank you very much for your comment!

      63. Based on this overview, which seems somewhat skewed towards the merits of these methods (conflict of interest, limited perspective on downsides to new methods/upsides to old methods), I am not quite ready to accept this effort as equivalent of a regular journal and pre-publication peer review process. I look forward to learning more about the approach and seeing this review method in action and as it develops.

      The Discussion section has been substantially revised to address this point. While I acknowledge the current scarcity of empirical studies on innovative peer review models, I have incorporated a critical discussion of this methodological gap.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review): 

      Zhu and colleagues used high-density Neuropixel probes to perform laminar recordings in V1 while presenting either small stimuli that stimulated the classical receptive field (CRF) or large stimuli whose border straddled the RF to provide nonclassical RF (nCRF) stimulation. Their main question was to understand the relative contribution of feedforward (FF), feedback (FB), and horizontal circuits to border ownership (Bown), which they addressed by measuring crosscorrelation across layers. They found differences in cross-correlation between feedback/horizontal (FH) and input layers during CRF and nCRF stimulation. 

      Although the data looks high quality and analyses look mostly fine, I had a lot of difficulty understanding the logic in many places. Examples of my concerns are written below. 

      (1) What is the main question? The authors refer to nCRF stimulation emerging from either feedback from higher areas or horizontal connections from within the same area (e.g. lines 136 to 138 and again lines 223-232). I initially thought that the study would aim to distinguish between the two. However, the way the authors have clubbed the layers in 3D, the main question seems to be whether Bown is FF or FH (i.e., feedback and horizontal are clubbed). Is this correct? If so, I don't see the logic, since I can't imagine Bown to be purely FF. Thus, just showing differences between CRF stimulation (which is mainly expected to be FF) and nCRF stimulation is not surprising to me. 

      We thank the reviewer for their thoughtful comments. As explained in the discussion, we grouped cortical layers to reduce uncertainty in precisely assigning laminar boundaries and to increase statistical power. Consequently, this limits our ability to distinguish the relative contributions of feedback inputs, primarily targeting layers 1 and 6, and horizontal connections, mainly within layers 2/3 and 5. Nevertheless, previous findings, especially regarding the rapid emergence of B<sub>own</sub> signals, suggest that feedback is more biologically plausible than horizontal-based mechanisms.

      Importantly, the emergence of B<sub>own</sub> signals in the primate brain should not be taken for granted. Direct physiological evidence that distinguishes feedforward from feedback/horizontal mechanisms has been lacking. While we agree it is unlikely that B<sub>own</sub> is mediated solely by feedforward processing, we felt it was necessary to test this empirically, particularly using highresolution laminar recordings.

      As discussed, feedforward models of B<sub>own</sub> have been proposed (e.g., Super, Romeo, and Keil, 2010; Saki and Nishimura, 2006). These could, in theory, be supported by more general nCRF modulations arising through early feedforward inhibitions, such as those observed in the retinogeniculate pathway (e.g., Webb, Tinsley, Vincent and Derrington, 2005; Blitz and Regehr, 2005; Alitto and Usrey, 2008). However, most B<sub>own</sub> models rely heavily on response latency, yet very few studies have recorded across layers or areas simultaneously to address this directly. Notably, recent findings in area V4 show that B<sub>own</sub> signals emerge earlier in deep layers than in granular (input) layers, suggesting a non-feedforward origin (Franken and Reynolds, 2021).

      Furthermore, although previous studies have shown that the nCRF can modulate firing rates and the timing of neuronal firing across layers, our findings go beyond these effects. We provide clear evidence that nCRF modulation also alters precise spike timing relationships and interlaminar coordination, and that the magnitude of nCRF modulation depends on these interlaminar interactions. This supports the idea that B<sub>own</sub> , or more general nCRF modulation, involves more than local rate changes, reflecting layer-specific network dynamics consistent with feedback or lateral integration.

      (2) Choice of layers for cross-correlation analysis: In the Introduction, and also in Figure 3C, it is mentioned that FF inputs arrive in 4C and 6, while FB/Horizontal inputs arrive at "superficial" and "deep", which I take as layer 2/3 and 5. So it is not clear to me why (i) layer 4A/B is chosen for analysis for Figure 3D (I would have thought layer 6 should have been chosen instead) and (ii) why Layers 5 and 6 are clubbed. 

      We thank the reviewer for raising this important point. The confusion likely stems from our use of the terms “superficial” and “deep” layers when describing the targets of feedback/horizontal inputs. To clarify, by “superficial” and “deep,” we specifically refer to layers 1–3 and layers 5–6, respectively, as illustrated in Figure 3C. Feedback and horizontal inputs relatively avoid entire layer 4, including both 4C and 4A/B.

      We also emphasize that the classification of layers as feedforward or feedback/horizontal recipients is relative rather than absolute. For example, although layer 6 receives both feedforward and feedback/horizontal inputs, it contains a higher proportion of feedback/horizontal inputs compared to layers 4C and 4A/B. 

      We had addressed this rationale in the Discussion, but recognize it may not have been sufficiently emphasized. We have revised the main text accordingly to clarify this point for readers in the final manuscript version.

      (3) Addressing the main question using cross-correlation analysis: I think the nice peaks observed in Figure 3B for some pairs show how spiking in one neuron affects the spiking in another one, with the delay in cross-correlation function arising from the conduction delay. This is shown nicely during CRF stimulation in Figure 3D between 4C -> 2/3, for example. However, the delay (positive or negative) is constrained by anatomical connectivity. For example, unless there are projections from 2/3 back to 4C which causes firing in a 2/3 layer neuron to cause a spike in a layer 4 neuron, we cannot expect to get a negative delay no matter what kind of stimulation (CRF versus nCRF) is used. 

      We thank the reviewer for the insightful comment. The observation that neurons within FH<sub>i</sub> laminar compartments (layers 2/3, 5/6) can lead those in layer 4 (4C, 4A/B) during nCRF stimulation may indeed seem unexpected. However, several anatomical pathways could mediate the propagation of B<sub>own</sub> signals from FH<sub>i</sub> compartments to layer 4. We have revised the Discussion section in the final version of the manuscript to address this point explicitly.

      In Macaque V1, projections from layers 2/3 to 4A/B have been documented (Blasdel et al., 1985; Callaway and Wiser, 1996), and neurons in 4A/B often extend apical dendrites into layers 2/3 (Lund, 1988; Yoshioka et al., 1994). Although direct projections from layers 2/3 to 4C are generally sparse (Callaway, 1998), a subset of neurons in the lower part of layer 3 can give off collateral axons to 4C (Lund and Yoshioka, 1991). Additionally, some 4C neurons extend dendrites into 4B, enabling potential dendritic integration of inputs from more superficial layers (Somogyi and Cowey, 1981; Mates and Lund, 1983; Yabuta and Callaway, 1998). Sparse connections from 2/3 to layer 4 have also been reported in cat V1 (Binzegger, Douglas and Martin, 2004). Moreover, layers 2/3 may influence 4C neurons disynaptically, without requiring dense monosynaptic connections. 

      Importantly, while CCGs can suggest possible circuit arrangements, functional connectivity may arise through mechanisms not fully captured by traditional anatomical tracing. Indeed, the apparent discrepancy between anatomical and functional data is not uncommon. For example, although 4B is known to receive anatomical input primarily from 4Cα, but not 4Cβ, photostimulation experiments have shown that 4B neurons can also be functionally driven by 4Cβ (Sawatari and Callaway, 1996). Our observation of functional inputs from layers 2/3 to layer 4 is also consistent with prior findings in rodent V1, where CCG analysis (e.g., Figure 7 in Senzai, Fernandez-Ruiz and Buzsaki, 2019) or photostimulation (Xu et al., 2016) revealed similar pathways. 

      Layers 5/6 provide dense projections to layers 4A/B (Lund, 1988; Callaway, 1998). In particular, layer 6 pyramidal neurons, especially the subset classified as Type 1 cells, project substantially to layer 4C (Wiser and Callaway, 1996; Fitzpatrick et al., 1985). 

      Reviewer #2 (Public review): 

      Summary: 

      The authors present a study of how modulatory activity from outside the classical receptive field (cRF) differs from cRF stimulation. They study neural activity across the different layers of V1 in two anesthetized monkeys using Neuropixels probes. The monkeys are presented with drifting gratings and border-ownership tuning stimuli. They find that border-ownership tuning is organized into columns within V1, which is unexpected and exciting, and that the flow of activity from cellto-cell (as judged by cross-correlograms between single units) is influenced by the type of visual stimulus: border-ownership tuning stimuli vs. drifting-grating stimuli. 

      Strengths: 

      The questions addressed by the study are of high interest, and the use of Neuropixels probes yields extremely high numbers of single-units and cross-correlation histograms (CCHs) which makes the results robust. The study is well-described. 

      Weaknesses: 

      The weaknesses of the study are (a) the use of anesthetized animals, which raises questions about the nature of the modulatory signal being measured and the underlying logic of why a change in visual stimulus would produce a reversal in information flow through the cortical microcircuit and (b) the choice of visual stimuli, which do not uniquely isolate feedforward from feedback influences. 

      (1) The modulation latency seems quite short in Figure 2C. Have the authors measured the latency of the effect in the manuscript and how it compares to the onset of the visually driven response? It would be surprising if the latency was much shorter than 70ms given previous measurements of BO and figure-ground modulation latency in V2 and V1. On the same note, it might be revealing to make laminar profiles of the modulation (i.e. preferred - non-preferred border orientation) as it develops over time. Does the modulation start in feedback recipient layers? 

      (2) Can the authors show the average time course of the response elicited by preferred and nonpreferred border ownership stimuli across all significant neurons? 

      We thank the reviewer for the insightful comment—this is indeed an important and often overlooked point. As noted in the Discussion, B<sub>own</sub> modulation differs from other forms of figure-ground modulation (e.g., Lamme et al., 1998) in that it can emerge very rapidly in early visual cortex—within ~10–35 ms after response onset (Zhou et al., 2000; Sugihara et al., 2011). This rapid emergence has been interpreted as evidence for the involvement of fast feedback inputs, which can propagate up to ten times faster than horizontal connections (Girard et al., 2001). Moreover, interlaminar interactions via monosynaptic or disynaptic connections can occur on very short timescales (a few milliseconds), further complicating efforts to disentangle feedback influences based solely on latency.

      Thus, while the early onset of modulation in our data may appear surprising, it is consistent with prior B<sub>own</sub> findings, and likely reflects a combination of fast feedback and rapid interlaminar processing. This makes it challenging to use conventional latency measurements to resolve laminar differences in B<sub>own</sub> modulation. Latency comparisons are well known to be susceptible to confounds such as variability in response onset, luminance, contrast, stimulus size, and other sensory parameters. 

      Although we did not explicitly quantify the latency of B<sub>own</sub> modulation in this manuscript, our cross-correlation analysis provides a more sensitive and temporally resolved measure of interlaminar information flow. We therefore focused on this approach rather than laminar modulation profiles, as it more directly addresses our primary research question.

      (3) The logic of assuming that cRF stimulation should produce the opposite signal flow to borderownership tuning stimuli is worth discussing. I suspect the key difference between stimuli is that they used drifting gratings as the cRF stimulus, the movement of the stimulus continually refreshes the retinal image, leading to continuous feedforward dominance of the signals in V1. Had they used a static grating, the spiking during the sustained portion of the response might also show more influence of feedback/horizontal connections. Do the initial spikes fired in response to the borderownership tuning stimuli show the feedforward pattern of responses? The authors state that they did not look at cross-correlations during the initial response, but if they do, do they see the feedforward-dominated pattern? The jitter CCH analysis might suffice in correcting for the response transient. 

      We thank the reviewer for the insightful comment. As noted in the final Results section, our CRF and nCRF stimulation paradigms differ in respects beyond the presence or absence of nonclassical modulation, including stimulus properties within the CRF.

      We agree with the reviewer’s speculation that drifting gratings may continually refresh the retinal image, promoting sustained feedforward dominance in V1, whereas static gratings might allow greater influence from feedback/horizontal inputs during the sustained response. Likewise, the initial response to the B<sub>own</sub> stimulus could be dominated by feedforward activity before feedback/horizontal influences arrive. 

      This contrast was a central motivation for our experimental design: we deliberately used two stimulus conditions — drifting gratings to emphasize feedforward processing, and B<sub>own</sub> stimuli, which are known to engage feedback modulation — to test whether these two conditions yield different patterns of interlaminar information flow. Our results confirm that they do. While we did not separately analyze the very initial spike period, our focus is on interlaminar information flow during the sustained response, which serves as the primary measure of feedback/horizontal engagement in this study.

      Finally, beyond this direct comparison, we show in Figure 5 that under nCRF stimulation alone, the direction and strength of interlaminar information flow correlate with the magnitude of B<sub>own</sub> modulation, further supporting the idea that our cross-correlation approach reveals functionally meaningful differences in cortical processing.

      (4) The term "nCRF stimulation" is not appropriate because the CRF is stimulated by the light/dark edge. 

      We thank the reviewer for the comment. As noted in the Introduction, nCRF effects described in the literature invariably involve stimulation both inside and outside the CRF. Our use of the term “nCRF stimulation” refers to this experimental paradigm, rather than suggesting that the CRF itself is unstimulated. We hope this clarifies our use of the term.

      Reviewer #3 (Public review): 

      Summary: 

      The paper by Zhu et al is on an important topic in visual neuroscience, the emergence in the visual cortex of signals about figures and ground. This topic also goes by the name border ownership. The paper utilizes modern recording techniques very skillfully to extend what is known about border ownership. It offers new evidence about the prevalence of border ownership signals across different cortical layers in V1 cortex. Also, it uses pairwise cross-correlation to study signal flow under different conditions of visual stimulation that include the border ownership paradigm. 

      Strengths: 

      The paper's strengths are its use of multi-electrode probes to study border ownership in many neurons simultaneously across the cortical layers in V1, and its innovation of using crosscorrelation between cortical neurons -- when they are viewing border-ownership patterns or instead are viewing grating patterns restricted to the classical receptive field (CRF). 

      Weaknesses: 

      The paper's weaknesses are its largely incremental approach to the study of border ownership and the lack of a critical analysis of the cross-correlation data. The paper as it is now does not advance our understanding of border ownership; it mainly confirms prior work, and it does not challenge or revise consensus beliefs about mechanisms. However, it is possible that, in the rich dataset the authors have obtained, they do possess data that could be added to the paper to make it much stronger. 

      Critique: 

      The border ownership data on V1 offered in the paper replicates experimental results obtained by Zhou and von der Heydt (2000) and confirms the earlier results using the same analysis methods as Zhou. The incremental addition is that the authors found border ownership in all cortical layers extending Zhou's results that were only about layer 2/3. 

      The cross-correlation results show that the pattern of the cross-correlogram (CCG) is influenced by the visual pattern being presented. However, the results are not analyzed mechanistically, and the interpretation is unclear. For instance, the authors show in Figure 3 (and in Figure S2) that the peak of the CCG can indicate layer 2/3 excites layer 4C when the visual stimulus is the border ownership test pattern, a large square 8 deg on a side. But how can layer 2/3 excite layer 4C? The authors do not raise or offer an answer to this question. Similar questions arise when considering the CCG of layer 4A/B with layer 2/3. What is the proposed pathway for layer 2/3 to excite 4A/B? Other similar questions arise for all the interlaminar CCG data that are presented. What known functional connections would account for the measured CCGs? 

      We thank the reviewer for raising this important point. As noted in our response to a previous comment, several anatomical pathways could mediate apparent functional inputs from layers 2/3 to 4C and 4A/B. In macaque V1, projections from layers 2/3 to 4A/B have been documented (Blasdel et al., 1985; Callaway and Wiser, 1996), and neurons in 4A/B often extend apical dendrites into layers 2/3 (Lund, 1988; Yoshioka et al., 1994). Although direct projections from layers 2/3 to 4C are generally sparse (Callaway, 1998), a subset of lower layer 3 neurons can give off collateral axons to 4C (Lund and Yoshioka, 1991). Some 4C neurons also extend dendrites into 4B, potentially allowing dendritic integration of inputs from more superficial layers (Somogyi and Cowey, 1981; Mates and Lund, 1983; Yabuta and Callaway, 1998). Sparse connections from 2/3 to layer 4 have also been reported in cat V1 (Binzegger et al., 2004).

      Moreover, layers 2/3 may influence 4C neurons disynaptically, without requiring dense monosynaptic connections. While CCGs suggest possible circuit arrangements, functional connectivity may arise through mechanisms not fully captured by anatomical tracing, and apparent discrepancies between anatomical and functional data are not uncommon. For example, although 4B is known to receive anatomical input primarily from 4Cα, 4B neurons can also be functionally driven by 4Cβ using photostimulation (Sawatari and Callaway, 1996). Our observation of functional inputs from layers 2/3 to layer 4 is also consistent with prior findings in rodent V1, where CCG analysis (e.g., Figure 7 in Senzai, Fernandez-Ruiz and Buzsaki, 2019) or photostimulation (Xu et al., 2016) revealed similar pathways. 

      Layers 5/6 also provide dense projections to layers 4A/B (Lund, 1988; Callaway, 1998). In particular, layer 6 pyramidal neurons, especially the subset classified as Type 1 cells, project substantially to layer 4C (Wiser and Callaway, 1996; Fitzpatrick et al., 1985). 

      We have revised the Discussion section to explicitly address these points and clarify the potential anatomical and functional pathways underlying the measured interlaminar CCGs, highlighting how inputs from layers 2/3 and 5/6 to layer 4 can be mediated via both direct and indirect connections.

      The problems in understanding the CCG data are indirectly caused by the lack of a critical analysis of what is happening in the responses that reveal the border ownership signals, as in Figure 2. Let's put it bluntly - are border ownership signals excitatory or inhibitory? The reason I raise this question is that the present authors insightfully place border ownership as examples of the action of the non-classical receptive field (nCRF) of cortical cells. Most previous work on the nCRF (many papers cited by the authors) reveal the nCRF to be inhibitory or suppressive. In order to know whether nCRF signals are excitatory or inhibitory, one needs a baseline response from the CRF, so that when you introduce nCRF signals you can tell whether the change with respect to the CRF is up or down. As far as I know, prior work on border ownership has not addressed this question, and the present paper doesn't either. This is where the rich dataset that the present authors possess might be used to establish a fundamental property of border ownership. 

      Then we must go back to consider what the consequences of knowing the sign of the border ownership signal would mean for interpreting the CCG data. If the border ownership signals from extrastriate feedback or, alternatively, from horizontal intrinsic connections, are excitatory, they might provide a shared excitatory input to pairs of cells that would show up in the CCG as a peak at 0 delay. However, if the border ownership manuscript signals are inhibitory, they might work by exciting only inhibitory neurons in V1. This could have complicated consequences for the CCG.The interpretation of the CCG data in the present version of the m is unclear (see above). Perhaps a clearer interpretation could be developed once the authors know better what the border ownership signals are. 

      We thank the reviewer for raising this fundamental and thought-provoking question. As noted, B<sub>own</sub> signals arise from nCRF, which has often been associated with suppressive effects. However, Zhang and von der Heydt (2010) provided important insight into this issue by systematically varying the placement of figure fragments outside the CRF while keeping an edge centered within the CRF. They found that contextual fragments on the preferred side of B<sub>own</sub> produce facilitation, while those on the non-preferred side produce suppression. Thus, the nCRF contribution to B<sub>own</sub> reflects both excitatory and inhibitory modulation, depending on the spatial configuration of the figure.

      These effects were well explained by their model in which feedback from grouping cells in higher areas selectively enhances or suppresses V1/V2 neuron responses, depending on their B<sub>own</sub> preference. In this framework, the B<sub>own</sub> signal itself is not inherently excitatory or inhibitory; rather, it results from the net effect of feedback, which can be either facilitative or suppressive. Importantly, it is the input that is modulated — not that the receiving neurons are necessarily inhibitory themselves.

      In the current study, our analysis focused on CCGs showing excessive coincident spiking, i.e., positive peaks, which are typically interpreted as evidence for shared excitatory input or excitatory connections. Due to the limited number of connections, we did not analyze inhibitory interactions, such as anti-correlations or delayed suppression in the CCGs, which would be expected if the reference neuron were inhibitory. Therefore, the CCGs we report here likely reflect the excitatory component of the B<sub>own</sub> signal, and possibly its upstream drive via feedback. While a full separation of excitatory and inhibitory components remains an important goal for future work, our data suggest that B<sub>own</sub> modulation is at least partially mediated through excitatory feedback input.

      My critique of the CCG analysis applies to Figure 5 also. I cannot comprehend the point of showing a very weak correlation of CCG asymmetry with Border Ownership Index, especially when what CCG asymmetry means is unclear mechanistically. Figure 5 does not make the paper stronger in my opinion. 

      We thank the reviewer for this comment. As described in the Results section for Figure 5, the observation that interlaminar information flow correlates with B<sub>own</sub> modulation is important because it demonstrates that these flow patterns are specifically related to the magnitude of B<sub>own</sub> signals, independent of the comparisons between CRF and nCRF stimulation. 

      In Figure 3, the authors show two CCGs that involve 4C--4C pairs. It would be nice to know more about such pairs. If there are any 6--6 pairs, what they look like also would be interesting. The authors also in Figure 3 show CCG's of two 4C--4A/B pairs and it would be quite interesting to know how such CCGs behave when CRF and nCRF stimuli are compared. In other words, the authors have shown us they have many data but have chosen not to analyze them further or to explain why they chose not to analyze them. It might help the paper if the authors would present all the CCG types they have. This suggestion would be helpful when the authors know more about the sign of border ownership signals, as discussed at length above. 

      We thank the reviewer for the insightful comment. The rationale for selecting specific laminar pairs is described in the Results section after Figure 3C and further discussed in the Discussion. In brief, we focused on CCGs computed from pairs in which one neuron resided in laminar compartments receiving feedback/horizontal inputs (layers 2/3 and 5/6) and the other within compartments relatively devoid of these inputs (layers 4C and 4A/B).

      To mitigate uncertainty in defining exact laminar boundaries and to maximize statistical power, we combined some anatomical layers into distinct laminar compartments. This approach allowed us to compare the relative spike timing between neuronal pairs during CRF and nCRF stimulation. If feedback/horizontal inputs contribute more during nCRF than CRF stimulation, we expect this to be reflected in the lead-lag relationships of the CCGs. While other pairs (e.g., 5/6–5/6 or 4C– 4A/B) could in principle be analyzed, the hypothesized patterns for these pairs are less clear, and thus they were not the focus of our study. Nonetheless, these additional pairs represent interesting directions for future work.

    1. Author response:

      The following is the authors’ response to the original reviews

      We thank all the reviewers for their constructive comments. We have carefully considered your feedback and revised the manuscript accordingly. The major concern raised was the applicability of SegPore to the RNA004 dataset. To address this, we compared SegPore with f5c and Uncalled4 on RNA004, and found that SegPore demonstrated improved performance, as shown in Table 2 of the revised manuscript.

      Following the reviewers’ recommendations, we updated Figures 3 and 4. Additionally, we added one table and three supplementary figures to the revised manuscript:

      · Table 2: Segmentation benchmark on RNA004 data

      · Supplementary Figure S4: RNA translocation hypothesis illustrated on RNA004 data

      · Supplementary Figure S5: Illustration of Nanopolish raw signal segmentation with eventalign results

      · Supplementary Figure S6: Running time of SegPore on datasets of varying sizes

      Below, we provide a point-by-point response to your comments.

      Reviewer #1 (Public review):

      Summary:

      In this manuscript, the authors describe a new computational method (SegPore), which segments the raw signal from nanopore-direct RNA-Seq data to improve the identification of RNA modifications. In addition to signal segmentation, SegPore includes a Gaussian Mixture Model approach to differentiate modified and unmodified bases. SegPore uses Nanopolish to define a first segmentation, which is then refined into base and transition blocks. SegPore also includes a modification prediction model that is included in the output. The authors evaluate the segmentation in comparison to Nanopolish and Tombo, and they evaluate the impact on m6A RNA modification detection using data with known m6A sites. In comparison to existing methods, SegPore appears to improve the ability to detect m6A, suggesting that this approach could be used to improve the analysis of direct RNA-Seq data.

      Strengths:

      SegPore addresses an important problem (signal data segmentation). By refining the signal into transition and base blocks, noise appears to be reduced, leading to improved m6A identification at the site level as well as for single-read predictions. The authors provide a fully documented implementation, including a GPU version that reduces run time. The authors provide a detailed methods description, and the approach to refine segments appears to be new.

      Weaknesses:

      In addition to Nanopolish and Tombo, f5c and Uncalled4 can also be used for segmentation, however, the comparison to these methods is not shown.

      The method was only applied to data from the RNA002 direct RNA-Sequencing version, which is not available anymore, currently, it remains unclear if the methods still work on RNA004.

      Thank you for your comments.

      To clarify the background, there are two kits for Nanopore direct RNA sequencing: RNA002 (the older version) and RNA004 (the newer version). Oxford Nanopore Technologies (ONT) introduced the RNA004 kit in early 2024 and has since discontinued RNA002. Consequently, most public datasets are based on RNA002, with relatively few available for RNA004 (as of 30 June 2025).

      Nanopolish and Tombo were developed for raw signal segmentation and alignment using RNA002 data, whereas f5c and Uncalled4are the only two software supporting RNA004 data.  Since the development of SegPore began in January 2022, we initially focused on RNA002 due to its data availability. Accordingly, our original comparisons were made against Nanopolish and Tombo using RNA002 data.

      We have now updated SegPore to support RNA004 and compared its performance against f5c and Uncalled4 on three public RNA004 datasets.

      As shown in Table 2 of the revised manuscript, SegPore outperforms both f5c and Uncalled4 in raw signal segmentation. Moreover, the jiggling translocation hypothesis underlying SegPore is further supported, as shown in Supplementary Figure S4.

      The overall improvement in accuracy appears to be relatively small.

      Thank you for the comment.

      We understand that the improvements shown in Tables 1 and 2 may appear modest at first glance due to the small differences in the reported standard deviation (std) values. However, even small absolute changes in std can correspond to substantial relative reductions in noise, especially when the total variance is low.

      To better quantify the improvement, we assume that approximately 20% of the std for Nanopolish, Tombo, f5c, and Uncalled4 arises from noise. Using this assumption, we calculate the relative noise reduction rate of SegPore as follows:

      Noise reduction rate = (baseline std − SegPore std) / (0.2 × baseline std) ​​

      Based on this formula, the average noise reduction rates across all datasets are:

      - SegPore vs Nanopolish: 49.52%

      - SegPore vs Tombo: 167.80%

      - SegPore vs f5c: 9.44%

      - SegPore vs Uncalled4: 136.70%

      These results demonstrate that SegPore can reduce the noise level by at least 9% given a noise level of 20%, which we consider a meaningful improvement for downstream tasks, such as base modification detection and signal interpretation. The high noise reduction rates observed in Tombo and Uncalled4 (over 100%) suggest that their actual noise proportion may be higher than our 20% assumption.

      We acknowledge that this 20% noise level assumption is an approximation. Our intention is to illustrate that SegPore provides measurable improvements in relative terms, even when absolute differences appear small.

      The run time and resources that are required to run SegPore are not shown, however, it appears that the GPU version is essential, which could limit the application of this method in practice.

      Thank you for your comment.

      Detailed instructions for running SegPore are provided in github (https://github.com/guangzhaocs/SegPore). Regarding computational resources, SegPore currently requires one CPU core and one Nvidia GPU to perform the segmentation task efficiently.

      We present SegPore’s runtime for typical datasets in Supplementary Figure S6 in the revised manuscript.  For a typical 1 GB fast5 file, the segmentation takes approximately 9.4 hours using a single NVIDIA DGX‑1 V100 GPU and one CPU core.

      Currently, GPU acceleration is essential to achieve practical runtimes with SegPore. We acknowledge that this requirement may limit accessibility in some environments. To address this, we are actively working on a full C++ implementation of SegPore that will support CPU-only execution. While development is ongoing, we aim to release this version in a future update.

      Reviewer #2 (Public review):

      Summary:

      The work seeks to improve the detection of RNA m6A modifications using Nanopore sequencing through improvements in raw data analysis. These improvements are said to be in the segmentation of the raw data, although the work appears to position the alignment of raw data to the reference sequence and some further processing as part of the segmentation, and result statistics are mostly shown on the 'data-assigned-to-kmer' level.

      As such, the title, abstract, and introduction stating the improvement of just the 'segmentation' does not seem to match the work the manuscript actually presents, as the wording seems a bit too limited for the work involved.

      The work itself shows minor improvements in m6Anet when replacing Nanopolish eventalign with this new approach, but clear improvements in the distributions of data assigned per kmer. However, these assignments were improved well enough to enable m6A calling from them directly, both at site-level and at read-level.

      Strengths:

      A large part of the improvements shown appear to stem from the addition of extra, non-base/kmer specific, states in the segmentation/assignment of the raw data, removing a significant portion of what can be considered technical noise for further analysis. Previous methods enforced the assignment of all raw data, forcing a technically optimal alignment that may lead to suboptimal results in downstream processing as data points could be assigned to neighbouring kmers instead, while random noise that is assigned to the correct kmer may also lead to errors in modification detection.

      For an optimal alignment between the raw signal and the reference sequence, this approach may yield improvements for downstream processing using other tools.<br /> Additionally, the GMM used for calling the m6A modifications provides a useful, simple, and understandable logic to explain the reason a modification was called, as opposed to the black models that are nowadays often employed for these types of tasks.

      Weaknesses:

      The work seems limited in applicability largely due to the focus on the R9's 5mer models. The R9 flow cells are phased out and not available to buy anymore. Instead, the R10 flow cells with larger kmer models are the new standard, and the applicability of this tool on such data is not shown. We may expect similar behaviour from the raw sequencing data where the noise and transition states are still helpful, but the increased kmer size introduces a large amount of extra computing required to process data and without knowledge of how SegPore scales, it is difficult to tell how useful it will really be. The discussion suggests possible accuracy improvements moving to 7mers or 9mers, but no reason why this was not attempted.

      Thank you for pointing out this important limitation. Please refer to our response to Point 1 of Reviewer 1 for SegPore’s performance on RNA004 data. Notably, the jiggling behavior is also observed in RNA004 data, and SegPore achieves better performance than both f5c and Uncalled4.

      The increased k-mer size in RNA004 affects only the training phase of SegPore (refer to Supplementary Note 1, Figure 5 for details on the training and testing phases). Once the baseline means and standard deviations for each k-mer are established, applying SegPore to RNA004 data proceeds similarly to RNA002. This is because each k-mer in the reference sequence has, at most, two states (modified and unmodified). While the larger k-mer size increases the size of the parameter table, it does not increase the computational complexity during segmentation. Although estimating the initial k-mer parameter table requires significant time and effort on our part, it does not affect the runtime for end users applying SegPore to RNA004 data.

      Extending SegPore from 5-mers to 7-mers or 9-mers for RNA002 data would require substantial effort to retrain the model and generate sufficient training data. Additionally, such an extension would make SegPore’s output incompatible with widely used upstream and downstream tools such as Nanopolish and m6Anet, complicating integration and comparison. For these reasons, we leave this extension for future work.

      The manuscript suggests the eventalign results are improved compared to Nanopolish. While this is believably shown to be true (Table 1), the effect on the use case presented, downstream differentiation between modified and unmodified status on a base/kmer, is likely limited as during actual modification calling the noisy distributions are usually 'good enough', and not skewed significantly in one direction to really affect the results too terribly.

      Thank you for your comment. While current state-of-the-art (SOTA) methods perform well on benchmark datasets, there remains significant room for improvement. Most SOTA evaluations are based on limited datasets, primarily covering DRACH motifs in human and mouse transcriptomes. However, m6A modifications can also occur in non-DRACH motifs, where current models may underperform. Additionally, other RNA modifications—such as pseudouridine, inosine, and m5C—are less studied, and their detection may benefit from improved signal modeling.

      We would also like to emphasize that raw signal segmentation and RNA modification detection are distinct tasks. SegPore focuses on the former, providing a cleaner, more interpretable signal that can serve as a foundation for downstream tasks. Improved segmentation may facilitate the development of more accurate RNA modification detection algorithms by the community.

      Scientific progress often builds incrementally through targeted improvements to foundational components. We believe that enhancing signal segmentation, as SegPore does, contributes meaningfully to the broader field—the full impact will become clearer as the tool is adopted into more complex workflows.

      Furthermore, looking at alternative approaches where this kind of segmentation could be applied, Nanopolish uses the main segmentation+alignment for a first alignment and follows up with a form of targeted local realignment/HMM test for modification calling (and for training too), decreasing the need for the near-perfect segmentation+alignment this work attempts to provide. Any tool applying a similar strategy probably largely negates the problems this manuscript aims to improve upon.

      We thank the reviewer for this insightful comment.

      To clarify, Nanopolish provides three independent commands: polya, eventalign, and call-methylation.

      - The polya command identifies the adapter, poly(A) tail, and transcript region in the raw signal.

      - The eventalign command aligns the raw signal to a reference sequence, assigning a signal segment to individual k-mers in the reference.

      - The call-methylation command detects methylated bases from DNA sequencing data.

      The eventalign command corresponds to “the main segmentation+alignment for a first alignment,” while call-methylation corresponds to “a form of targeted local realignment/HMM test for modification calling,” as mentioned in the reviewer’s comment. SegPore’s segmentation is similar in purpose to Nanopolish’s eventalign, while its RNA modification estimation component is similar in concept to Nanopolish’s call-methylation.

      We agree the general idea may appear similar, but the implementations are entirely different. Importantly, Nanopolish’s call-methylation is designed for DNA sequencing data, and its models are not trained to recognize RNA modifications. This means they address distinct research questions and cannot be directly compared on the same RNA modification estimation task. However, it is valid to compare them on the segmentation task, where SegPore exhibits better performance (Table 1).

      We infer the reviewer may suggest that because m6Anet is a deep neural network capable of learning from noisy input, the benefit of more accurate segmentation (such as that provided by SegPore) might be limited. This concern may arise from the limited improvement of SegPore+m6Anet over Nanopolish+m6Anet in bulk analysis (Figure 3). Several factors may contribute to this observation:

      (i) For reads aligned to the same gene in the in vivo data, alignment may be inaccurate due to pseudogenes or transcript isoforms.

      (ii) The in vivo benchmark data are inherently more complex than in vitro datasets and may contain additional modifications (e.g., m5C, m7G), which can confound m6A calling by altering the signal baselines of k-mers.

      (iii) m6Anet is trained on events produced by Nanopolish and may not be optimal for SegPore-derived events.

      (iv) The benchmark dataset lacks a modification-free (IVT) control sample, making it difficult to establish a true baseline for each k-mer.

      In the IVT data (Figure 4), SegPore shows a clear improvement in single-molecule m6A identification, with a 3~4% gain in both ROC-AUC and PR-AUC. This demonstrates SegPore’s practical benefit for applications requiring higher sensitivity at the molecule level.

      As noted earlier, SegPore’s contribution lies in denoising and improving the accuracy of raw signal segmentation, which is a foundational step in many downstream analyses. While it may not yet lead to a dramatic improvement in all applications, it already provides valuable insights into the sequencing process (e.g., cleaner signal profiles in Figure 4) and enables measurable gains in modification detection at the single-read level. We believe SegPore lays the groundwork for developing more accurate and generalizable RNA modification detection tools beyond m6A.

      We have also added the following sentence in the discussion to highlight SegPore’s limited performance in bulk analysis:

      “The limited improvement of SegPore combined with m6Anet over Nanopolish+m6Anet in bulk in vivo analysis (Figure 3) may be explained by several factors: potential alignment inaccuracies due to pseudogenes or transcript isoforms, the complexity of in vivo datasets containing additional RNA modifications (e.g., m5C, m7G) affecting signal baselines, and the fact that m6Anet is specifically trained on events produced by Nanopolish rather than SegPore. Additionally, the lack of a modification-free control (in vitro transcribed) sample in the benchmark dataset makes it difficult to establish true baselines for each k-mer. Despite these limitations, SegPore demonstrates clear improvement in single-molecule m6A identification in IVT data (Figure 4), suggesting it is particularly well suited for in vitro transcription data analysis.”

      Finally, in the segmentation/alignment comparison to Nanopolish, the latter was not fitted(/trained) on the same data but appears to use the pre-trained model it comes with. For the sake of comparing segmentation/alignment quality directly, fitting Nanopolish on the same data used for SegPore could remove the influences of using different training datasets and focus on differences stemming from the algorithm itself.

      In the segmentation benchmark (Table 1), SegPore uses the fixed 5-mer parameter table provided by ONT. The hyperparameters of the HHMM are also fixed and not estimated from the raw signal data being segmented. Only in the m6A modification task,  SegPore does perform re-estimation of the baselines for the modified and unmodified states of k-mers. Therefore, the comparison with Nanopolish is fair, as both tools rely on pre-defined models during segmentation.

      Appraisal:

      The authors have shown their method's ability to identify noise in the raw signal and remove their values from the segmentation and alignment, reducing its influences for further analyses. Figures directly comparing the values per kmer do show a visibly improved assignment of raw data per kmer. As a replacement for Nanopolish eventalign it seems to have a rather limited, but improved effect, on m6Anet results. At the single read level modification modification calling this work does appear to improve upon CHEUI.

      Impact:

      With the current developments for Nanopore-based modification largely focusing on Artificial Intelligence, Neural Networks, and the like, improvements made in interpretable approaches provide an important alternative that enables a deeper understanding of the data rather than providing a tool that plainly answers the question of whether a base is modified or not, without further explanation. The work presented is best viewed in the context of a workflow where one aims to get an optimal alignment between raw signal data and the reference base sequence for further processing. For example, as presented, as a possible replacement for Nanopolish eventalign. Here it might enable data exploration and downstream modification calling without the need for local realignments or other approaches that re-consider the distribution of raw data around the target motif, such as a 'local' Hidden Markov Model or Neural Networks. These possibilities are useful for a deeper understanding of the data and further tool development for modification detection works beyond m6A calling.

      Reviewer #3 (Public review):

      Summary:

      Nucleotide modifications are important regulators of biological function, however, until recently, their study has been limited by the availability of appropriate analytical methods. Oxford Nanopore direct RNA sequencing preserves nucleotide modifications, permitting their study, however, many different nucleotide modifications lack an available base-caller to accurately identify them. Furthermore, existing tools are computationally intensive, and their results can be difficult to interpret.

      Cheng et al. present SegPore, a method designed to improve the segmentation of direct RNA sequencing data and boost the accuracy of modified base detection.

      Strengths:

      This method is well-described and has been benchmarked against a range of publicly available base callers that have been designed to detect modified nucleotides.

      Weaknesses:

      However, the manuscript has a significant drawback in its current version. The most recent nanopore RNA base callers can distinguish between different ribonucleotide modifications, however, SegPore has not been benchmarked against these models.

      I recommend that re-submission of the manuscript that includes benchmarking against the rna004_130bps_hac@v5.1.0 and rna004_130bps_sup@v5.1.0 dorado models, which are reported to detect m5C, m6A_DRACH, inosine_m6A and PseU.<br /> A clear demonstration that SegPore also outperforms the newer RNA base caller models will confirm the utility of this method.

      Thank you for highlighting this important limitation. While Dorado, the new ONT basecaller, is publicly available and supports modification-aware basecalling, suitable public datasets for benchmarking m5C, inosine, m6A, and PseU detection on RNA004 are currently lacking. Dorado’s modification-aware models are trained on ONT’s internal data, which is not publicly released. Therefore, it is not currently feasible to evaluate or directly compare SegPore’s performance against Dorado for m5C, inosine, m6A, and PseU detection.

      We would also like to emphasize that SegPore’s main contribution lies in raw signal segmentation, which is an upstream task in the RNA modification detection pipeline. To assess its performance in this context, we benchmarked SegPore against f5c and Uncalled4 on public RNA004 datasets for segmentation quality. Please refer to our response to Point 1 of Reviewer 1 for details.

      Our results show that the characteristic “jiggling” behavior is also observed in RNA004 data (Supplementary Figure S4), and SegPore achieves better segmentation performance than both f5c and Uncalled4 (Table 2).

      Recommendations for the authors:

      Reviewing Editor:

      Please note that we also received the following comments on the submission, which we encourage you to take into account:

      took a look at the work and for what I saw it only mentions/uses RNA002 chemistry, which is deprecated, effectively making this software unusable by anyone any more, as RNA002 is not commercially available. While the results seem promising, the authors need to show that it would work for RNA004. Notably, there is an alternative software for resquiggling for RNA004 (not Tombo or Nanopolish, but the GPU-accelerated version of Nanopolish (f5C), which does support RNA004. Therefore, they need to show that SegPore works for RNA004, because otherwise it is pointless to see that this method works better than others if it does not support current sequencing chemistries and only works for deprecated chemistries, and people will keep using f5C because its the only one that currently works for RNA004. Alternatively, if there would be biological insights won from the method, one could justify not implementing it in RNA004, but in this case, RNA002 is deprecated since March 2024, and the paper is purely methodological.

      Thank you for the comment. We agree that support for current sequencing chemistries is essential for practical utility. While SegPore was initially developed and benchmarked on RNA002 due to the availability of public data, we have now extended SegPore to support RNA004 chemistry.

      To address this concern, we performed a benchmark comparison using public RNA004 datasets against tools specifically designed for RNA004, including f5c and Uncalled4. Please refer to our response to Point 1 of Reviewer 1 for details. The results show that SegPore consistently outperforms f5c and Uncalled4 in segmentation accuracy on RNA004 data.

      Reviewer #2 (Recommendations for the authors):

      Various statements are made throughout the text that require further explanation, which might actually be defined in more detail elsewhere sometimes but are simply hard to find in the current form.

      (1) Page 2, “In this technique, five nucleotides (5mers) reside in the nanopore at a time, and each 5mer generates a characteristic current signal based on its unique sequence and chemical properties (16).”

      5mer? Still on R9 or just ignoring longer range influences, relevant? It is indeed a R9.4 model from ONT.

      Thank you for the observation. We apologize for the confusion and have clarified the relevant paragraph to indicate that the method is developed for RNA002 data by default. Specifically, we have added the following sentence:

      “Two versions of the direct RNA sequencing (DRS) kits are available: RNA002 and RNA004. Unless otherwise specified, this study focuses on RNA002 data.”

      (2) Page 3, “Employ models like Hidden Markov Models (HMM) to segment the signal, but they are prone to noise and inaccuracies.”

      That's the alignment/calling part, not the segmentation?

      Thank you for the comment. We apologize for the confusion. To clarify the distinction between segmentation and alignment, we added a new paragraph before the one in question to explain the general workflow of Nanopore DRS data analysis and to clearly define the task of segmentation. The added text reads:

      “The general workflow of Nanopore direct RNA sequencing (DRS) data analysis is as follows. First, the raw electrical signal from a read is basecalled using tools such as Guppy or Dorado, which produce the nucleotide sequence of the RNA molecule. However, these basecalled sequences do not include the precise start and end positions of each ribonucleotide (or k-mer) in the signal. Because basecalling errors are common, the sequences are typically mapped to a reference genome or transcriptome using minimap2 to recover the correct reference sequence. Next, tools such as Nanopolish and Tombo align the raw signal to the reference sequence to determine which portion of the signal corresponds to each k-mer. We define this process as the segmentation task, referred to as "eventalign" in Nanopolish. Based on this alignment, Nanopolish extracts various features—such as the start and end positions, mean, and standard deviation of the signal segment corresponding to a k-mer. This signal segment or its derived features is referred to as an "event" in Nanopolish.”

      We also revised the following paragraph describing SegPore to more clearly contrast its approach:

      “In SegPore, we first segment the raw signal into small fragments using a Hierarchical Hidden Markov Model (HHMM), where each fragment corresponds to a sub-state of a k-mer. Unlike Nanopolish and Tombo, which directly align the raw signal to the reference sequence, SegPore aligns the mean values of these small fragments to the reference. After alignment, we concatenate all fragments that map to the same k-mer into a larger segment, analogous to the "eventalign" output in Nanopolish. For RNA modification estimation, we use only the mean signal value of each reconstructed event.”

      We hope this revision clarifies the difference between segmentation and alignment in the context of our method and resolves the reviewer’s concern.

      (3) Page 4, Figure 1, “These segments are then aligned with the 5mer list of the reference sequence fragment using a full/partial alignment algorithm, based on a 5mer parameter table. For example, 𝐴𝑗 denotes the base "A" at the j-th position on the reference.”

      I think I do understand the meaning, but I do not understand the relevance of the Aj bit in the last sentence. What is it used for?

      When aligning the segments (output from Step 2) to the reference sequence in Step 3, it is possible for multiple segments to align to the same k-mer. This can occur particularly when the reference contains consecutive identical bases, such as multiple adenines (A). For example, as shown in Fig. 1A, Step 3, the first two segments (μ₁ and μ₂) are aligned to the first 'A' in the reference sequence, while the third segment is aligned to the second 'A'. In this case, the reference sequence AACTGGTTTC...GTC, which contains exactly two consecutive 'A's at the start. This notation helps to disambiguate segment alignment in regions with repeated bases.

      Additionally, this figure and its subscript include mapping with Guppy and Minimap2 but do not mention Nanopolish at all, while that seems an equally important step in the preprocessing (pg5). As such it is difficult to understand the role Nanopolish exactly plays. It's also not mentioned explicitly in the SegPore Workflow on pg15, perhaps it's part of step 1 there?

      We thank the reviewer for pointing this out. We apologize for the confusion. As mentioned in the public response to point 3 of Reviewer 2, SegPore uses Nanopolish to identify the poly(A) tail and transcript regions from the raw signal. SegPore then performs segmentation and alignment on the transcript portion only. This step is indeed part of Step 1 in the preprocessing workflow, as described in Supplementary Note 1, Section 3.

      To clarify this in the main text, we have updated the preprocessing paragraph on page 6 to explicitly describe the role of Nanopolish:

      “We begin by performing basecalling on the input fast5 file using Guppy, which converts the raw signal data into ribonucleotide sequences. Next, we align the basecalled sequences to the reference genome using Minimap2, generating a mapping between the reads and the reference sequences. Nanopolish provides two independent commands: "polya" and "eventalign".
The "polya" command identifies the adapter, poly(A) tail, and transcript region in the raw signal, which we refer to as the poly(A) detection results. The raw signal segment corresponding to the poly(A) tail is used to standardize the raw signal for each read. The "eventalign" command aligns the raw signal to a reference sequence, assigning a signal segment to individual k-mers in the reference. It also computes summary statistics (e.g., mean, standard deviation) from the signal segment for each k-mer. Each k-mer together with its corresponding signal features is termed an event. These event features are then passed into downstream tools such as m6Anet and CHEUI for RNA modification detection. For full transcriptome analysis (Figure 3), we extract the aligned raw signal segment and reference sequence segment from Nanopolish's events for each read by using the first and last events as start and end points. For in vitro transcription (IVT) data with a known reference sequence (Figure 4), we extract the raw signal segment corresponding to the transcript region for each input read based on Nanopolish’s poly(A) detection results.”

      Additionally, we revised the legend of Figure 1A to explicitly include Nanopolish in step 1 as follows:

      “The raw current signal fragments are paired with the corresponding reference RNA sequence fragments using Nanopolish.”

      (4) Page 5, “The output of Step 3 is the "eventalign," which is analogous to the output generated by the Nanopolish "eventalign" command.”

      Naming the function of Nanopolish, the output file, and later on (pg9) the alignment of the newly introduced methods the exact same "eventalign" is very confusing.

      Thank you for the helpful comment. We acknowledge the potential confusion caused by using the term “eventalign” in multiple contexts. To improve clarity, we now consistently use the term “events” to refer to the output of both Nanopolish and SegPore, rather than using "eventalign" as a noun. We also added the following sentence to Step 3 (page 6) to clearly define what an “event” refers to in our manuscript:

      “An "event" refers to a segment of the raw signal that is aligned to a specific k-mer on a read, along with its associated features such as start and end positions, mean current, standard deviation, and other relevant statistics.”

      We have revised the text throughout the manuscript accordingly to reduce ambiguity and ensure consistent terminology.

      (5) Page 5, “Once aligned, we use Nanopolish's eventalign to obtain paired raw current signal segments and the corresponding fragments of the reference sequence, providing a precise association between the raw signals and the nucleotide sequence.”

      I thought the new method's HHMM was supposed to output an 'eventalign' formatted file. As this is not clearly mentioned elsewhere, is this a mistake in writing? Is this workflow dependent on Nanopolish 'eventalign' function and output or not?

      We apologize for the confusion. To clarify, SegPore is not dependent on Nanopolish’s eventalign function for generating the final segmentation results. As described in our response to your comment point 2 and elaborated in the revised text on page 4, SegPore uses its own HHMM-based segmentation model to divide the raw signal into small fragments, each corresponding to a sub-state of a k-mer. These fragments are then aligned to the reference sequence based on their mean current values.

      As explained in the revised manuscript:

      “In SegPore, we first segment the raw signal into small fragments using a Hierarchical Hidden Markov Model (HHMM), where each fragment corresponds to a sub-state of a k-mer. Unlike Nanopolish and Tombo, which directly align the raw signal to the reference sequence, SegPore aligns the mean values of these small fragments to the reference. After alignment, we concatenate all fragments that map to the same k-mer into a larger segment, analogous to the "eventalign" output in Nanopolish. For RNA modification estimation, we use only the mean signal value of each reconstructed event.”

      To avoid ambiguity, we have also revised the sentence on page 5 to more clearly distinguish the roles of Nanopolish and SegPore in the workflow. The updated sentence now reads:

      “Nanopolish provides two independent commands: "polya" and "eventalign".
The "polya" command identifies the adapter, poly(A) tail, and transcript region in the raw signal, which we refer to as the poly(A) detection results. The raw signal segment corresponding to the poly(A) tail is used to standardize the raw signal for each read. The "eventalign" command aligns the raw signal to a reference sequence, assigning a signal segment to individual k-mers in the reference. It also computes summary statistics (e.g., mean, standard deviation) from the signal segment for each k-mer. Each k-mer together with its corresponding signal features is termed an event. These event features are then passed into downstream tools such as m6Anet and CHEUI for RNA modification detection. For full transcriptome analysis (Figure 3), we extract the aligned raw signal segment and reference sequence segment from Nanopolish's events for each read by using the first and last events as start and end points. For in vitro transcription (IVT) data with a known reference sequence (Figure 4), we extract the raw signal segment corresponding to the transcript region for each input read based on Nanopolish’s poly(A) detection results.”

      (6) Page 5, “Since the polyA tail provides a stable reference, we normalize the raw current signals across reads, ensuring that the mean and standard deviation of the polyA tail are consistent across all reads.”

      Perhaps I misread this statement: I interpret it as using the PolyA tail to do the normalization, rather than using the rest of the signal to do the normalization, and that results in consistent PolyA tails across all reads.

      If it's the latter, this should be clarified, and a little detail on how the normalization is done should be added, but if my first interpretation is correct:

      I'm not sure if its standard deviation is consistent across reads. The (true) value spread in this section of a read should be fairly limited compared to the rest of the signal in the read, so the noise would influence the scale quite quickly, and such noise might be introduced to pores wearing down and other technical influences. Is this really better than using the non-PolyA tail part of the reads signal, using Median Absolute Deviation to scale for a first alignment round, then re-fitting the signal scaling using Theil Sen on the resulting alignments (assigned read signal vs reference expected signal), as Tombo/Nanopolish (can) do?

      Additionally, this kind of normalization should have been part of the Nanopolish eventalign already, can this not be re-used? If it's done differently it may result in different distributions than the ONT kmer table obtained for the next step.

      Thank you for this detailed and thoughtful comment. We apologize for the confusion. The poly(A) tail–based normalization is indeed explained in Supplementary Note 1, Section 3, but we agree that the motivation needed to be clarified in the main text.

      We have now added the following sentence in the revised manuscript (before the original statement on page 5 to provide clearer context:

      “Due to inherent variability between nanopores in the sequencing device, the baseline levels and standard deviations of k-mer signals can differ across reads, even for the same transcript. To standardize the signal for downstream analyses, we extract the raw current signal segments corresponding to the poly(A) tail of each read. Since the poly(A) tail provides a stable reference, we normalize the raw current signals across reads, ensuring that the mean and standard deviation of the poly(A) tail are consistent across all reads. This step is crucial for reducing…..”

      We chose to use the poly(A) tail for normalization because it is sequence-invariant—i.e., all poly(A) tails consist of identical k-mers, unlike transcript sequences which vary in composition. In contrast, using the transcript region for normalization can introduce biases: for instance, reads with more diverse k-mers (having inherently broader signal distributions) would be forced to match the variance of reads with more uniform k-mers, potentially distorting the baseline across k-mers.

      In our newly added RNA004 benchmark experiment, we used the default normalization provided by f5c, which does not include poly(A) tail normalization. Despite this, SegPore was still able to mask out noise and outperform both f5c and Uncalled4, demonstrating that our segmentation method is robust to different normalization strategies.

      (7) Page 7, “The initialization of the 5mer parameter table is a critical step in SegPore's workflow. By leveraging ONT's established kmer models, we ensure that the initial estimates for unmodified 5mers are grounded in empirical data.”

      It looks like the method uses Nanopolish for a first alignment, then improves the segmentation matching the reference sequence/expected 5mer values. I thought the Nanopolish model/tables are based on the same data, or similarly obtained. If they are different, then why the switch of kmer model? Now the original alignment may have been based on other values, and thus the alignment may seem off with the expected kmer values of this table.

      Thank you for this insightful question. To clarify, SegPore uses Nanopolish only to identify the poly(A) tail and transcript regions from the raw signal. In the bulk in vivo data analysis, we use Nanopolish’s first event as the start and the last event as the end to extract the aligned raw signal chunk and its corresponding reference sequence. Since SegPore relies on Nanopolish solely to delineate the transcript region for each read, it independently aligns the raw signals to the reference sequence without refining or adjusting Nanopolish’s segmentation results.

      While SegPore's 5-mer parameter table is initially seeded using ONT’s published unmodified k-mer models, we acknowledge that empirical signal values may deviate from these reference models due to run-specific technical variation and the presence of RNA modifications. For this reason, SegPore includes a parameter re-estimation step to refine the mean and standard deviation values of each k-mer based on the current dataset.

      The re-estimation process consists of two layers. In the outer layer, we select a set of 5mers that exhibit both modified and unmodified states based on the GMM results (Section 6 of Supplementary Note 1), while the remaining 5mers are assumed to have only unmodified states. In the inner layer, we align the raw signals to the reference sequences using the 5mer parameter table estimated in the outer layer (Section 5 of Supplementary Note 1). Based on the alignment results, we update the 5mer parameter table in the outer layer. This two-layer process is generally repeated for 3~5 iterations until the 5mer parameter table converges.This re-estimation ensures that:

      (1) The adjusted 5mer signal baselines remain close to the ONT reference (for consistency);

      (2) The alignment score between the observed signal and the reference sequence is optimized (as detailed in Equation 11, Section 5 of Supplementary Note 1);

      (3) Only 5mers that show a clear difference between the modified and unmodified components in the GMM are considered subject to modification.

      By doing so, SegPore achieves more accurate signal alignment independent of Nanopolish’s models, and the alignment is directly tuned to the data under analysis.

      (8) Page 9, “The output of the alignment algorithm is an eventalign, which pairs the base blocks with the 5mers from the reference sequence for each read (Fig. 1C).”

      “Modification prediction

      After obtaining the eventalign results, we estimate the modification state of each motif using the 5mer parameter table.”

      This wording seems to have been introduced on page 5 but (also there) reads a bit confusingly as the name of the output format, file, and function are now named the exact same "eventalign". I assume the obtained eventalign results now refer to the output of your HHMM, and not the original Nanopolish eventalign results, based on context only, but I'd rather have a clear naming that enables more differentiation.

      We apologize for the confusion. We have revised the sentence as follows for clarity:

      “A detailed description of both alignment algorithms is provided in Supplementary Note 1. The output of the alignment algorithm is an alignment that pairs the base blocks with the 5mers from the reference sequence for each read (Fig. 1C). Base blocks aligned to the same 5-mer are concatenated into a single raw signal segment (referred to as an “event”), from which various features—such as start and end positions, mean current, and standard deviation—are extracted. Detailed derivation of the mean and standard deviation is provided in Section 5.3 in Supplementary Note 1. In the remainder of this paper, we refer to these resulting events as the output of eventalign analysis or the segmentation task. ”

      (9) Page 9, “Since a single 5mer can be aligned with multiple base blocks, we merge all aligned base blocks by calculating a weighted mean. This weighted mean represents the single base block mean aligned with the given 5mer, allowing us to estimate the modification state for each site of a read.”

      I assume the weights depend on the length of the segment but I don't think it is explicitly stated while it should be.

      Thank you for the helpful observation. To improve clarity, we have moved this explanation to the last paragraph of the previous section (see response to point 8), where we describe the segmentation process in more detail.

      Additionally, a complete explanation of how the weighted mean is computed is provided in Section 5.3 of Supplementary Note 1. It is derived from signal points that are assigned to a given 5mer.

      (10) Page 10, “Afterward, we manually adjust the 5mer parameter table using heuristics to ensure that the modified 5mer distribution is significantly distinct from the unmodified distribution.”

      Using what heuristics? If this is explained in the supplementary notes then please refer to the exact section.

      Thank you for pointing this out. The heuristics used to manually adjust the 5mer parameter table are indeed explained in detail in Section 7 of Supplementary Note 1.

      To clarify this in the manuscript, we have revised the sentence as follows:

      “Afterward, we manually adjust the 5mer parameter table using heuristics to ensure that the modified 5mer distribution is significantly distinct from the unmodified distribution (see details in Section 7 of Supplementary Note 1).”

      (11) Page 10, “Once the table is fixed, it is used for RNA modification estimation in the test data without further updates.”

      By what tool/algorithm? Perhaps it is your own implementation, but with the next section going into segmentation benchmarking and using Nanopolish before this seems undefined.

      Thank you for pointing this out. We use our own implementation. See Algorithm 3 in Section 6 of Supplementary Note 1.

      We have revised the sentence for clarity:

      “Once a stabilized 5mer parameter table is estimated from the training data, it is used for RNA modification estimation in the test data without further updates. A more detailed description of the GMM re-estimation process is provided in Section 6 of Supplementary Note 1.”

      (12) Page 11, “A 5mer was considered significantly modified if its read coverage exceeded 1,500 and the distance between the means of the two Gaussian components in the GMM was greater than 5.”

      Considering the scaling done before also not being very detailed in what range to expect, this cutoff doesn't provide any useful information. Is this a pA value?

      Thank you for the observation. Yes, the value refers to the current difference measured in picoamperes (pA). To clarify this, we have revised the sentence in the manuscript to include the unit explicitly:

      “A 5mer was considered significantly modified if its read coverage exceeded 1,500 and the distance between the means of the two Gaussian components in the GMM was greater than 5 picoamperes (pA).”

      (13) Page 13, “The raw current signals, as shown in Figure 1B.”

      Wrong figure? Figure 2B seems logical.

      Thank you for catching this. You are correct—the reference should be to Figure 2B, not Figure 1B. We have corrected this in the revised manuscript.

      (14) Page 14, Figure 2A, these figures supposedly support the jiggle hypothesis but the examples seem to match only half the explanation. Any of these jiggles seem to be followed shortly by another in the opposite direction, and the amplitude seems to match better within each such pair than the next or previous segments. Perhaps there is a better explanation still, and this behaviour can be modelled as such instead.

      Thank you for your comment. We acknowledge that the observed signal patterns may appear ambiguous and could potentially suggest alternative explanations. However, as shown in Figure 2A, the red dots tend to align closely with the baseline of the previous state, while the blue dots align more closely with the baseline of the next state. We interpret this as evidence for the "jiggling" hypothesis, where k-mer temporarily oscillates between adjacent states during translocation.

      That said, we agree that more sophisticated models could be explored to better capture this behavior, and we welcome suggestions or references to alternative models. We will consider this direction in future work.

      (15) Page 15, “This occurs because subtle transitions within a base block may be mistaken for transitions between blocks, leading to inflated transition counts.”

      Is it really a "subtle transition" if it happens within a base block? It seems this is not a transition and thus shouldn't be named as such.

      Thank you for pointing this out. We agree that the term “subtle transition” may be misleading in this context. We revised the sentence to clarify the potential underlying cause of the inflated transition counts:

      “This may be due to a base block actually corresponding to a sub-state of a single 5mer, rather than each base block corresponding to a full 5mer, leading to inflated transition counts. To address this issue, SegPore’s alignment algorithm was refined to merge multiple base blocks (which may represent sub-states of the same 5mer) into a single 5mer, thereby facilitating further analysis.”

      (16) Page 15, “The SegPore "eventalign" output is similar to Nanopolish's "eventalign" command.”

      To the output of that command, I presume, not to the command itself.

      Thank you for pointing out the ambiguity. We have revised the sentence for clarity:

      “The final outputs of SegPore are the events and modification state predictions. SegPore’s events are similar to the outputs of Nanopolish’s "eventalign" command, in that they pair raw current signal segments with the corresponding RNA reference 5-mers. Each 5-mer is associated with various features — such as start and end positions, mean current, and standard deviation — derived from the paired signal segment.”

      (17) Page 15, “For selected 5mers, SegPore also provides the modification rate for each site and the modification state of that site on individual reads.”

      What selection? Just all kmers with a possible modified base or a more specific subset?

      We revised the sentence to clarify the selection criteria:

      “For selected 5mers that exhibit both a clearly unmodified and a clearly modified signal component, SegPore reports the modification rate at each site, as well as the modification state of that site on individual reads.”

      (18) Page 16, “A key component of SegPore is the 5mer parameter table, which specifies the mean and standard deviation for each 5mer in both modified and unmodified states (Figure 2A).”

      Wrong figure?

      Thank you for pointing this out. You are correct—it should be Figure 1A, not Figure 2A. We intended to visually illustrate the structure of the 5mer parameter table in Figure 1A, and we have corrected this reference in the revised manuscript.

      (19) Page 16, Table 1, I can't quite tell but I assume this is based on all kmers in the table, not just a m6A modified subset. A short added statement to make this clearer would help.

      Yes, you are right—it is averaged over all 5mers. We have revised the sentence for clarity as follows:

      " As shown in Table 1, SegPore consistently achieved the best performance averaged on all 5mers across all datasets..…."

      (20) Page 16, “Since the peaks (representing modified and unmodified states) are separable for only a subset of 5mers, SegPore can provide modification parameters for these specific 5mers. For other 5mers, modification state predictions are unavailable.”

      Can this be improved using some heuristics rather than the 'distance of 5' cutoff as described before? How small or big is this subset, compared to how many there should be to cover all cases?

      We agree that more sophisticated strategies could potentially improve performance. In this study, we adopted a relatively conservative approach to minimize false positives by using a heuristic cutoff of 5 picoamperes. This value was selected empirically and we did not explore alternative cutoffs. Future work could investigate more refined or data-driven thresholding strategies.

      (21) Page 16, “Tombo used the "resquiggle" method to segment the raw signals, and we standardized the segments using the polyA tail to ensure a fair comparison.”

      I don't know what or how something is "standardized" here.

      Standardized’ refers to the poly(A) tail–based signal normalization described in our response to point 6. We applied this normalization to Tombo’s output to ensure a fair comparison across methods. Without this standardization, Tombo’s performance was notably worse. We revised the sentence as follows:

      “Tombo used the "resquiggle" method to segment the raw signals, and we standardized the segments using the poly(A) tail to ensure a fair comparison (See preprocessing section in Materials and Methods).”

      (22) Page 16, “To benchmark segmentation performance, we used two key metrics: (1) the log-likelihood of the segment mean, which measures how closely the segment matches ONT's 5mer parameter table (used as ground truth), and (2) the standard deviation (std) of the segment, where a lower std indicates reduced noise and better segmentation quality. If the raw signal segment aligns correctly with the corresponding 5mer, its mean should closely match ONT's reference, yielding a high log-likelihood. A lower std of the segment reflects less noise and better performance overall.”

      Here the segmentation part becomes a bit odd:

      A: Low std can be/is achieved by dropping any noisy bits, making segments really small (partly what happens here with the transition segments). This may be 'true' here, in the sense that the transition is not really part of the segment, but the comparison table is a bit meaningless as the other tools forcibly assign all data to kmers, instead of ignoring parts as transition states. In other words, it is a benchmark that is easy to cheat by assigning more data to noise/transition states.

      B: The values shown are influenced by the alignment made between the read and expected reference signal. Especially Tombo tends to forcibly assign data to whatever looks the most similar nearby rather than providing the correct alignment. So the "benchmark of the segmentation performance" is more of an "overall benchmark of the raw signal alignment". Which is still a good, useful thing, but the text seems to suggest something else.

      Thank you for raising these important concerns regarding the segmentation benchmarking.

      Regarding point A, the base blocks aligned to the same 5mer are concatenated into a single segment, including the short transition blocks between them. These transition blocks are typically very short (4~10 signal points, average 6 points), while a typical 5mer segment contains around 20~60 signal points. To assess whether SegPore’s performance is inflated by excluding transition segments, we conducted an additional comparison: we removed 6 boundary signal points (3 from the start and 3 from the end) from each 5mer segment in Nanopolish and Tombo’s results to reduce potential noise. The new comparison table is shown in the following:

      SegPore consistently demonstrates superior performance. Its key contribution lies in its ability to recognize structured noise in the raw signal and to derive more accurate mean and standard deviation values that more faithfully represent the true state of the k-mer in the pore. The improved mean estimates are evidenced by the clearly separated peaks of modified and unmodified 5mers in Figures 3A and 4B, while the improved standard deviation is reflected in the segmentation benchmark experiments.

      Regarding point B, we apologize for the confusion. We have added a new paragraph to the introduction to clarify that the segmentation task indeed includes the alignment step.

      “The general workflow of Nanopore direct RNA sequencing (DRS) data analysis is as follows. First, the raw electrical signal from a read is basecalled using tools such as Guppy or Dorado, which produce the nucleotide sequence of the RNA molecule. However, these basecalled sequences do not include the precise start and end positions of each ribonucleotide (or k-mer) in the signal. Because basecalling errors are common, the sequences are typically mapped to a reference genome or transcriptome using minimap2 to recover the correct reference sequence. Next, tools such as Nanopolish and Tombo align the raw signal to the reference sequence to determine which portion of the signal corresponds to each k-mer. We define this process as the segmentation task, referred to as "eventalign" in Nanopolish. Based on this alignment, Nanopolish extracts various features—such as the start and end positions, mean, and standard deviation of the signal segment corresponding to a k-mer. This signal segment or its derived features is referred to as an "event" in Nanopolish. The resulting events serve as input for downstream RNA modification detection tools such as m6Anet and CHEUI.”

      (23) Page 17 “Given the comparable methods and input data requirements, we benchmarked SegPore against several baseline tools, including Tombo, MINES (26), Nanom6A (27), m6Anet, Epinano (28), and CHEUI (29).”

      It seems m6Anet is actually Nanopolish+m6Anet in Figure 3C, this needs a minor clarification here.

      m6Anet uses Nanopolish’s estimated events as input by default.

      (24) Page 18, Figure 3, A and B are figures without any indication of what is on the axis and from the text I believe the position next to each other on the x-axis rather than overlapping is meaningless, while their spread is relevant, as we're looking at the distribution of raw values for this 5mer. The figure as is is rather confusing.

      Thanks for pointing out the confusion. We have added concrete values to the axes in Figures 3A and 3B and revised the figure legend as follows in the manuscript:

      “(A) Histogram of the estimated mean from current signals mapped to an example m6A-modified genomic location (chr10:128548315, GGACT) across all reads in the training data, comparing Nanopolish (left) and SegPore (right). The x-axis represents current in picoamperes (pA).

      (B) Histogram of the estimated mean from current signals mapped to the GGACT motif at all annotated m6A-modified genomic locations in the training data, again comparing Nanopolish (left) and SegPore (right). The x-axis represents current in picoamperes (pA).”

      (25) Page 18 “SegPore's results show a more pronounced bimodal distribution in the raw signal segment mean, indicating clearer separation of modified and unmodified signals.”

      Without knowing the correct values around the target kmer (like Figure 4B), just the more defined bimodal distribution could also indicate the (wrongful) assignment of neighbouring kmer values to this kmer instead, hence this statement lacks some needed support, this is just one interpretation of the possible reasons.

      Thank you for the comment. We have added concrete values to Figures 3A and 3B to support this point. Both peaks fall within a reasonable range: the unmodified peak (125 pA) is approximately 1.17 pA away from its reference value of 123.83 pA, and the modified peak (118 pA) is around 7 pA away from the unmodified peak. This shift is consistent with expected signal changes due to RNA modifications (usually less than 10 pA), and the magnitude of the difference suggests that the observed bimodality is more likely caused by true modification events rather than misalignment.

      (26) Page 18 “Furthermore, when pooling all reads mapped to m6A-modified locations at the GGACT motif, SegPore showed prominent peaks (Fig. 3B), suggesting reduced noise and improved modification detection.”

      I don't think the prominent peaks directly suggest improved detection, this statement is a tad overreaching.

      We revised the sentense to the following:

      “SegPore exhibited more distinct peaks (Fig. 3B), indicating reduced noise and potentially enabling more reliable modification detection”.

      (27) Page18 “(2) direct m6A predictions from SegPore's Gaussian Mixture Model (GMM), which is limited to the six selected 5mers.”

      The 'six selected' refers to what exactly? Also, 'why' this is limited to them is also unclear as it is, and it probably would become clearer if it is clearly defined what this refers to.

      It is explained the page 16 in the SegPore’s workflow in the original manuscript as follows:

      “A key component of SegPore is the 5mer parameter table, which specifies the mean and standard deviation for each 5mer in both modified and unmodified states (Fig. 2A1A). Since the peaks (representing modified and unmodified states) are separable for only a subset of 5mers, SegPore can provide modification parameters for these specific 5mers. For other 5mers, modification state predictions are unavailable.”

      e select a small set of 5mers that show clear peaks (modified and unmodified 5mers) in GMM in the m6A site-level data analysis. These 5mers are provided in Supplementary Fig. S2C, as explained in the section “m6A site level benchmark” in the Material and Methods (page 12 in the original manuscript).

      “…transcript locations into genomic coordinates. It is important to note that the 5mer parameter table was not re-estimated for the test data. Instead, modification states for each read were directly estimated using the fixed 5mer parameter table. Due to the differences between human (Supplementary Fig. S2A) and mouse (Supplementary Fig. S2B), only six 5mers were found to have m6A annotations in the test data’s ground truth (Supplementary Fig. S2C). For a genomic location to be identified as a true m6A modification site, it had to correspond to one of these six common 5mers and have a read coverage of greater than 20. SegPore derived the ROC and PR curves for benchmarking based on the modification rate at each genomic location….”

      We have updated the sentence as follows to increase clarity:

      “which is limited to the six selected 5mers that exhibit clearly separable modified and unmodified components in the GMM (see Materials and Methods for details).”

      (28) Page 19, Figure 4C, the blue 'Unmapped' needs further explanation. If this means the segmentation+alignment resulted in simply not assigning any segment to a kmer, this would indicate issues in the resulting mapping between raw data and kmers as the data that probably belonged to this kmer is likely mapped to a neighbouring kmer, possibly introducing a bimodal distribution there.

      This is due to deletion event in the full alignment algorithm. See Page 8 of SupplementaryNote1:

      During the traceback step of the dynamic programming matrix, not every 5mer in the reference sequence is assigned a corresponding raw signal fragment—particularly when the signal’s mean deviates substantially from the expected mean of that 5mer. In such cases, the algorithm considers the segment to be generated by an unknown 5mer, and the corresponding reference 5mer is marked as unmapped.

      (29) Page 19, “For six selected m6A motifs, SegPore achieved an ROC AUC of 82.7% and a PR AUC of 38.7%, earning the third-best performance compared with deep leaning methods m6Anet and CHEUI (Fig. 3D).”

      How was this selection of motifs made, are these related to the six 5mers in the middle of Supplementary Figure S2? Are these the same six as on page 18? This is not clear to me.

      It is the same, see the response to point 27.

      (30) Page 21 “Biclustering reveals that modifications at the 6th, 7th, and 8th genomic locations are specific to certain clusters of reads (clusters 4, 5, and 6), while the first five genomic locations show similar modification patterns across all reads.”

      This reads rather confusingly. Both the '6th, 7th, and 8th genomic locations' and 'clusters 4,5,6' should be referred to in clearer terms. Either mark them in the figure as such or name them in the text by something that directly matches the text in the figure.

      We have added labels to the clusters and genomic locations Figure 4C, and revised the sentence as follows:

      “Biclustering reveals that modifications at g6 are specific to cluster C4, g7 to cluster C5, and g8 to cluster C6, while the first five genomic locations (g1 to g5) show similar modification patterns across all reads.”

      (31) Page 21, “We developed a segmentation algorithm that leverages the jiggling property in the physical process of DRS, resulting in cleaner current signals for m6A identification at both the site and single-molecule levels.”

      Leverages, or just 'takes into account'?

      We designed our HHMM specifically based on the jiggling hypothesis, so we believe that using the term “leverage” is appropriate.

      (32) Page 21, “Our results show that m6Anet achieves superior performance, driven by SegPore's enhanced segmentation.”

      Superior in what way? It barely improves over Nanopolish in Figure 3C and is outperformed by other methods in Figure 3D. The segmentation may have improved but this statement says something is 'superior' driven by that 'enhanced segmentation', so that cannot refer to the segmentation itself.

      We revise it as follows in the revised manuscript:

      ”Our results demonstrate that SegPore’s segmentation enables clear differentiation between m6A-modified and unmodified adenosines.”

      (33) Page 21, “In SegPore, we assume a drastic change between two consecutive 5mers, which may hold for 5mers with large difference in their current baselines but may not hold for those with small difference.”

      The implications of this assumption don't seem highlighted enough in the work itself and may be cause for falsely discovering bi-modal distributions. What happens if such a 5mer isn't properly split, is there no recovery algorithm later on to resolve these cases?

      We agree that there is a risk of misalignment, which can result in a falsely observed bimodal distribution. This is a known and largely unavoidable issue across all methods, including deep neural network–based methods. For example, many of these models rely on a CTC (Connectionist Temporal Classification) layer, which implicitly performs alignment and may also suffer from similar issues.

      Misalignment is more likely when the current baselines of neighboring k-mers are close. In such cases, the model may struggle to confidently distinguish between adjacent k-mers, increasing the chance that signals from neighboring k-mers are incorrectly assigned. Accurate baseline estimation for each k-mer is therefore critical—when baselines are accurate, the correct alignment typically corresponds to the maximum likelihood.

      We have added the following sentence to the discussion to acknowledge this limitation:

      “As with other RNA modification estimation methods, SegPore can be affected by misalignment errors, particularly when the baseline signals of adjacent k-mers are similar. These cases may lead to spurious bimodal signal distributions and require careful interpretation.”

      (34) Page 21, “Currently, SegPore models only the modification state of the central nucleotide within the 5mer. However, modifications at other positions may also affect the signal, as shown in Figure 4B. Therefore, introducing multiple states to the 5mer could help to improve the performance of the model.”

      The meaning of this statement is unclear to me. Is SegPore unable to combine the information of overlapping kmers around a possibly modified base (central nucleotide), or is this referring to having multiple possible modifications in a single kmer (multiple states)?

      We mean there can be modifications at multiple positions of a single 5mer, e.g. C m5C m6A m7G T. We have revised the sentence to:

      “Therefore, introducing multiple states for a 5mer to accout for modifications at mutliple positions within the same 5mer could help to improve the performance of the model.”

      (35) Page 22, “This causes a problem when apply DNN-based methods to new dataset without short read sequencing-based ground truth. Human could not confidently judge if a predicted m6A modification is a real m6A modification.”

      Grammatical errors in both these sentences. For the 'Human could not' part, is this referring to a single person's attempt or more extensively tested?

      Thanks for the comment. We have revised the sentence as follows:

      “This poses a challenge when applying DNN-based methods to new datasets without short-read sequencing-based ground truth. In such cases, it is difficult for researchers to confidently determine whether a predicted m6A modification is genuine (see Supplmentary Figure S5).”

      (36) Page 22, “…which is easier for human to interpret if a predicted m6A site is real.”

      "a" human, but also this probably meant to say 'whether' instead of 'if', or 'makes it easier'.

      Thanks for the advice. We have revise the sentence as follows:

      “One can generally observe a clear difference in the intensity levels between 5mers with an m6A and those with a normal adenosine, which makes it easier for a researcher to interpret whether a predicted m6A site is genuine.”

      (37) Page 22, “…and noise reduction through its GMM-based approach…”

      Is the GMM providing noise reduction or segmentation?

      Yes, we agree that it is not relevant. We have removed the sentence in the revised manuscript as follows:

      “Although SegPore provides clear interpretability and noise reduction through its GMM-based approach, there is potential to explore DNN-based models that can directly leverage SegPore's segmentation results.”

      (38) Page 23, “SegPore effectively reduces noise in the raw signal, leading to improved m6A identification at both site and single-molecule levels…”

      Without further explanation in what sense this is meant, 'reduces noise' seems to overreach the abilities, and looks more like 'masking out'.

      Following the reviewer’s suggestion, we change it to ‘mask out'’ in the revised manuscript.

      “SegPore effectively masks out noise in the raw signal, leading to improved m6A identification at both site and single-molecule levels.”

      Reviewer #3 (Recommendations for the authors):

      I recommend the publication of this manuscript, provided that the following comments (and the comments above) are addressed.

      In general, the authors state that SegPore represents an improvement on existing software. These statements are largely unquantified, which erodes their credibility. I have specified several of these in the Minor comments section.

      Page 5, Preprocessing: The authors comment that the poly(A) tail provides a stable reference that is crucial for the normalisation of all reads. How would this step handle reads that have variable poly(A) tail lengths? Or have interrupted poly(A) tails (e.g. in the case of mRNA vaccines that employ a linker sequence)?

      We apologize for the confusion. The poly(A) tail–based normalization is explained in Supplementary Note 1, Section 3.

      As shown in Author response image 1 below, the poly(A) tail produces a characteristic signal pattern—a relatively flat, squiggly horizontal line. Due to variability between nanopores, raw current signals often exhibit baseline shifts and scaling of standard deviations. This means that the signal may be shifted up or down along the y-axis and stretched or compressed in scale.

      Author response image 1.

      The normalization remains robust with variable poly(A) tail lengths, as long as the poly(A) region is sufficiently long. The linker sequence will be assigned to the adapter part rather than the poly(A) part.

      To improve clarity in the revised manuscript, we have added the following explanation:

      “Due to inherent variability between nanopores in the sequencing device, the baseline levels and standard deviations of k-mer signals can differ across reads, even for the same transcript. To standardize the signal for downstream analyses, we extract the raw current signal segments corresponding to the poly(A) tail of each read. Since the poly(A) tail provides a stable reference, we normalize the raw current signals across reads, ensuring that the mean and standard deviation of the poly(A) tail are consistent across all reads. This step is crucial for reducing…..”

      We chose to use the poly(A) tail for normalization because it is sequence-invariant—i.e., all poly(A) tails consist of identical k-mers, unlike transcript sequences which vary in composition. In contrast, using the transcript region for normalization can introduce biases: for instance, reads with more diverse k-mers (having inherently broader signal distributions) would be forced to match the variance of reads with more uniform k-mers, potentially distorting the baseline across k-mers.

      Page 7, 5mer parameter table: r9.4_180mv_70bps_5mer_RNA is an older kmer model (>2 years). How does your method perform with the newer RNA kmer models that do permit the detection of multiple ribonucleotide modifications? Addressing this comment is crucial because it is feasible that SegPore will underperform in comparison to the newer RNA base caller models (requiring the use of RNA004 datasets).

      Thank you for highlighting this important point. For RNA004, we have updated SegPore to ensure compatibility with the latest kit. In our revised manuscript, we demonstrate that the translocation-based segmentation hypothesis remains valid for RNA004, as supported by new analyses presented in the supplementary Figure S4.

      Additionally, we performed a new benchmark with f5c and Uncalled4 in RNA004 data in the revised manuscript (Table 2), where SegPore exhibit a better performance than f5c and Uncalled4.

      We agree that benchmarking against the latest Dorado models—specifically rna004_130bps_hac@v5.1.0 and rna004_130bps_sup@v5.1.0, which include built-in modification detection capabilities—would provide valuable context for evaluating the utility of SegPore. However, generating a comprehensive k-mer parameter table for RNA004 requires a large, well-characterized dataset. At present, such data are limited in the public domain. Additionally, Dorado is developed by ONT and its internal training data have not been released, making direct comparisons difficult.

      Our current focus is on improving raw signal segmentation quality, which are upstream tasks critical to many downstream analyses, including RNA modification detection. Future work may include benchmarking SegPore against models like Dorado once appropriate data become available.

      The Methods and Results sections contain redundant information - please streamline the information in these sections and reduce the redundancy. For example, the benchmarking section may be better situated in the Results section.

      Following your advice, we have removed redundant texts about the Segmentation benchmark from Materials and Methods in the revised manuscript.

      Minor comments

      (1) Introduction

      Page 3: "By incorporating these dynamics into its segmentation algorithm...". Please provide an example of how motor protein dynamics can impact RNA translocation. In particular, please elaborate on why motor protein dynamics would impact the translocation of modified ribonucleotides differently to canonical ribonucleotides. This is provided in the results, but please also include details in the Introduction.

      Following your advice, we added one sentence to explain how the motor protein affect the translocation of the DNA/RNA molecule in the revised manuscript.

      “This observation is also supported by previous reports, in which the helicase (the motor protein) translocates the DNA strand through the nanopore in a back-and-forth manner. Depending on ATP or ADP binding, the motor protein may translocate the DNA/RNA forward or backward by 0.5-1 nucleotides.”

      As far as we understand, this translocation mechanism is not specific to modified or unmodified nucleotides. For further details, we refer the reviewer to the original studies cited.

      Page 3: "This lack of interpretability can be problematic when applying these methods to new datasets, as researchers may struggle to trust the predictions without a clear understanding of how the results were generated." Please provide details and citations as to why researchers would struggle to trust the predictions of m6Anet. Is it due to a lack of understanding of how the method works, or an empirically demonstrated lack of reliability?

      Thank you for pointing this out. The lack of interpretability in deep learning models such as m6Anet stems primarily from their “black-box” nature—they provide binary predictions (modified or unmodified) without offering clear reasoning or evidence for each call.

      When we examined the corresponding raw signals, we found it difficult to visually distinguish whether a signal segment originated from a modified or unmodified ribonucleotide. The difference is often too subtle to be judged reliably by a human observer. This is illustrated in the newly added Supplementary Figure S5, which shows Nanopolish-aligned raw signals for the central 5mer GGACT in Figure 4B, displayed both uncolored and colored by modification state (according to the ground truth).

      Although deep neural networks can learn subtle, high-dimensional patterns in the signal that may not be readily interpretable, this opacity makes it difficult for researchers to trust the predictions—especially in new datasets where no ground truth is available. The issue is not necessarily an empirically demonstrated lack of reliability, but rather a lack of transparency and interpretability.

      We have updated the manuscript accordingly and included Supplementary Figure S5 to illustrate the difficulty in interpreting signal differences between modified and unmodified states.

      Page 3: "Instead of relying on complex, opaque features...". Please provide evidence that the research community finds the figures generated by m6Anet to be difficult to interpret, or delete the sections relating to its perceived lack of usability.

      See the figure provided in the response to the previous point. We added a reference to this figure in the revised manuscript.

      “Instead of relying on complex, opaque features (see Supplementary Figure S5), SegPore leverages baseline current levels to distinguish between…..”

      (2) Materials and Methods

      Page 5, Preprocessing: "We begin by performing basecalling on the input fast5 file using Guppy, which converts the raw signal data into base sequences.". Please change "base" to ribonucleotide.

      Revised as requested.

      Page 5 and throughout, please refer to poly(A) tail, rather than polyA tail throughout.

      Revised as requested.

      Page 5, Signal segmentation via hierarchical Hidden Markov model: "...providing more precise estimates of the mean and variance for each base block, which are crucial for downstream analyses such as RNA modification prediction." Please specify which method your HHMM method improves upon.

      Thank you for the suggestion. Since this section does not include a direct comparison, we revised the sentence to avoid unsupported claims. The updated sentence now reads:

      "...providing more precise estimates of the mean and variance for each base block, which are crucial for downstream analyses such as RNA modification prediction."

      Page 10, GMM for 5mer parameter table re-estimation: "Typically, the process is repeated three to five times until the 5mer parameter table stabilizes." How is the stabilisation of the 5mer parameter table quantified? What is a reasonable cut-off that would demonstrate adequate stabilisation of the 5mer parameter table?

      Thank you for the comment. We assess the stabilization of the 5mer parameter table by monitoring the change in baseline values across iterations. If the absolute change in baseline values for all 5mers is less than 1e-5 between two consecutive iterations, we consider the estimation to have stabilized.

      Page 11, M6A site level benchmark: why were these datasets selected? Specifically, why compare human and mouse ribonuclotide modification profiles? Please provide a justification and a brief description of the experiments that these data were derived from, and why they are appropriate for benchmarking SegPore.

      Thank you for the comment. These data are taken from a previous benchmark studie about m6A estimation from RNA002 data in the literature (https://doi.org/10.1038/s41467-023-37596-5). We think the data are appropreciate here.

      Thank you for the comment. The datasets used were taken from a previous benchmark study on m6A estimation using RNA002 data (https://doi.org/10.1038/s41467-023-37596-5). These datasets include human and mouse transcriptomes and have been widely used to evaluate the performance of RNA modification detection tools. We selected them because (i) they are based on RNA002 chemistry, which matches the primary focus of our study, and (ii) they provide a well-characterized and consistent benchmark for assessing m6A detection performance. Therefore, we believe they are appropriate for validating SegPore.

      (3) Results

      Page 13, RNA translocation hypothesis: "The raw current signals, as shown in Fig. 1B...". Please check/correct figure reference - Figure 1B does not show raw current signals.

      Thank you for pointing this out. The correct reference should be Figure 2B. We have updated the figure citation accordingly in the revised manuscript.

      Page 19, m6A identification at the site level: "For six selected m6A motifs, SegPore achieved an ROC AUC of 82.7% and a PR AUC of 38.7%, earning the third best performance compared with deep leaning methods m6Anet and CHEUI (Fig. 3D)." SegPore performs third best of all deep learning methods. Do the authors recommend its use in conjunction with m6Anet for m6A detection? Please clarify in the text.

      This sentence aims to convey that SegPore alone can already achieve good performance. If interpretability is the primary goal, we recommend using SegPore on its own. However, if the objective is to identify more potential m6A sites, we suggest using the combined approach of SegPore and m6Anet. That said, we have chosen not to make explicit recommendations in the main text to avoid oversimplifying the decision or potentially misleading readers.

      Page 19, m6A identification at the single molecule level: "one transcribed with m6A and the other with normal adenosine". I assume that this should be adenine? Please replace adenosine with adenine throughout.

      Thank you for pointing this out. We have revised the sentence to use "adenine" where appropriate. In other instances, we retain "adenosine" when referring specifically to adenine bound to a ribose sugar, which we believe is suitable in those contexts.

      Page 19, m6A identification at the single molecule level: "We used 60% of the data for training and 40% for testing". How many reads were used for training and how many for testing? Please comment on why these are appropriate sizes for training and testing datasets.

      In total, there are 1.9 million reads, with 1.14 million used for training and 0.76 million  for testing (60% and 40%, respectively). We chose this split to ensure that the training set is sufficiently large to reliably estimate model parameters, while the test set remains substantial enough to robustly evaluate model performance. Although the ratio was selected somewhat arbitrarily, it balances the need for effective training with rigorous validation.

      (4) Discussion

      Page 21: "We believe that the de-noised current signals will be beneficial for other downstream tasks." Which tasks? Please list an example.

      We have revised the text for clarity as follows:

      “We believe that the de-noised current signals will be beneficial for other downstream tasks, such as the estimation of m5C, pseudouridine, and other RNA modifications.”

      Page 22: "One can generally observe a clear difference in the intensity levels between 5mers with a m6A and normal adenosine, which is easier for human to interpret if a predicted m6A site is real." This statement is vague and requires qualification. Please reference a study that demonstrates the human ability to interpret two similar graphs, and demonstrate how it relates to the differences observed in your data.

      We apologize for the confusion. We have revised the sentence as follows:

      “One can generally observe a clear difference in the intensity levels between 5mers with an m6A and those with a normal adenosine, which makes it easier for a researcher to interpret whether a predicted m6A site is genuine.”

      We believe that Figures 3A, 3B, and 4B effectively illustrate this concept.

      Page 23: How long does SegPore take for its analyses compared to other similar tools? How long would it take to analyse a typical dataset?

      We have added run-time statistics for datasets of varying sizes in the revised manuscript (see Supplementary Figure S6). This figure illustrates SegPore’s performance across different data volumes to help estimate typical processing times.

      (5) Figures

      Figure 4C. Please number the hierachical clusters and genomic locations in this figure. They are referenced in the text.

      Following your suggestion, we have labeled the hierarchical clusters and genomic locations in Figure 4C in the revised manuscript.

      In addition, we revised the corresponding sentence in the main text as follows: “Biclustering reveals that modifications at g6 are specific to cluster C4, g7 to cluster C5, and g8 to cluster C6, while the first five genomic locations (g1 to g5) show similar modification patterns across all reads.”

    1. Author response:

      The following is the authors’ response to the original reviews

      Recommendations for the Authors:

      Reviewer #1:

      We think that this manuscript brings an important contribution that will be of interest in the areas of statistical physicists, (microbiota) ecology, and (biological) data science. The evidence of their results is solid and the work improves the state-of-the-art in terms of methods. We have a few concerns that, in our opinion, the authors should address.

      Major concerns:

      (1) While the paper could be of interest for the broad audience of e-Life, the way it is written is accessible mainly to physicists. We encourage the authors to take the broad audience into account by i) explaining better the essence of what is being done at each step, ii) highlighting the relevance of the method compared to other methods, iii) discussing the ecological implications of the results.

      Examples on how to approach i) include: Modify or expand Figure 1 so that non-familiar readers can understand the summary of the work (e.g. with cartoons representing communities, diseased states and bacterial interactions and their relationship with the inference method); in each section, summarize at the beginning the purpose of what is going to be addressed in this section, and summarize at the end what the section has achieved; in Figure 2, replace symbols by their meaning as much as possible-the same for Figure 1, at the very least in the figure caption.

      Example on how to approach ii): Since the authors aim to establish a bridge between disordered systems and microbiome ecology, it could be useful to expand a bit the introduction on disordered systems for biologists/biophysicists. This could be done with an additional text box, which could also highlight the advantages of this approach in comparison to other techniques (e.g. model-free approaches can also classify healthy and diseased states).

      Example on how to approach iii): The authors could discuss with more depth the ecological implications of their results. For example, do they have a hypothesis on why demographic and neutral effects could dominate in healthy patients?

      We thank the reviewer for the observations. Following the suggestion in the revised version, each section outlines the goal of what will be addressed in that section, and summarizes what we have achieved at the end; We also updated Figure 1 and Figure 2.

      (i) For figure 1, we expanded and hopefully made more clear how we conceptualize the problem, use the data, andestablish our method. In Figure 2, we enriched the y labels of each panel with the name associated with the order parameter.

      (ii) We thank the reviewer for helping us improve the readability of the introductory part, thus providing moreinsights into disordered systems techniques for a broader audience. We have added a few explanations at the end of page 2 – to explain the advantages of such methodology compared to other strategies and models.

      (iii) We thank the reviewer for raising the need for a more in-depth ecological discussion of our results. A simple wayto understand why neutral effects may dominate in healthy patients is the following. Neutrality implies that species differences are mainly shaped by stochastic processes such as demographic noise, with species treated as different realizations of the same underlying stochastic ecological dynamics. In our analysis, we observe that healthy individuals tend to exhibit highly similar microbial communities, suggesting that the compositional variability among their microbiomes is compatible—at least in part—with the fluctuations expected from demographic stochasticity alone. In contrast, patients with the disease display significantly more heterogeneous microbial compositions. The diversity and structure of their gut communities cannot be satisfactorily explained by neutral demographic fluctuations alone.

      This discrepancy implies that additional deterministic forces—such as altered ecological interactions—are driving the divergence observed in dysbiotic states. In diseased individuals, the breakdown of such interactions leads to a structurally distinct regime that may correspond to a phase of marginal stability, as indicated by our theoretical modeling. This shift marks a transition from a community governed by neutrality and demographic noise to one dominated by non-neutral ecological forces (as depicted in Figure 4). We added these comments in the discussion section of the revised manuscript.

      (2) Taking into account the broader audience, we invite the authors to edit the abstract, as it seems to jump from one ecological concept to another without explicitly communicating what is the link between these concepts. From the first two sentences, the motivation seems to be species diversity, but no mention of diversity comes after the second sentence. There is no proper introduction/definition of what macroecological states are. After that, the authors switch to healthy and unhealthy states, without previously introducing any link between gut microbiota states and the host’s health (which perhaps could be good in the first or second sentence, although other framings can be as valid). After that, interactions appear in the text and are related to instability, but the reader might not know whether this is surprising or if healthy/unhealthy states are generally related to stability.

      We pointed out a few examples, but the authors could extend their revision on i), ii) and iii) beyond such specific comments. In our opinion, this would really benefit the paper.

      In response to the reviewer’s concern about conceptual clarity and structure, we substantially revised the abstract to improve its accessibility and logical flow. In the revised abstract, we now clearly link species diversity to microbiome structure and function from the outset, addressing initial confusion. We provide a concise definition of ”macroecological states,” framing them as reproducible statistical patterns reflecting community-level properties. Additionally, the revised version explicitly connects gut microbiome states to host health earlier, resolving the previous abrupt shift in focus. Finally, we conclude by highlighting how disordered systems theory advances our understanding of microbiome stability and functioning, reinforcing the novelty and broader significance of our approach. Overall, the revised abstract better serves a broad interdisciplinary audience, including readers unfamiliar with the technicalities of disordered systems or microbial ecology, while preserving the scientific depth and accuracy of our work

      (3) The connection with consumer-resource (CR) models is quite unusual. In Equation (12), why do the authors assume that the consumption term does not depend on R? This should be addressed, since this term is usually dependent on R in microbial ecology models.

      In case this is helpful, it is known that the symmetric Lotka-Volterra model emerges from time-scale separation in the MacArthur model, where resources reproduce logistically and are consumed by other species (e.g., plants eaten by herbivores). Consumer-resource models form a broad category, while the MacArthur model is a specific case featuring logistic resource growth. For microbes, a more meaningful justification of the generalized Lotka-Volterra (GLV) model from a consumer-resource perspective involves the consumer-resource dynamics in a chemostat, where time-scale separation is assumed and higher-order interactions are neglected. See, for example: a) The classic paper by MacArthur: R. MacArthur. Species packing and competitive equilibrium for many species. Theoretical Population Biology, 1(1):1-11, 1970. b) Recent works on time-scale separation in chemostat consumer-resource models: Anna Posfai et al., PRL, 2017 Sireci et al., PNAS, 2023 Akshit Goyal et al., PRX-Life, 2025

      We thank the reviewer for the observation. We apologize for the typo that appeared in the main text and that we promptly corrected. The Consumers-Resources model we had in mind is the classical case proposed by MacArthur, where resources are self-regulated according to a logistic growth mechanism, which leads to the generalized LotkaVolterra model we employ in our work.

      Minor concerns:

      (1) The title has a nice pun for statistical physicists, but we wonder if it can be a bit confusing for the broader audience of e-Life. Although we leave this to the author’s decision, we’d recommend considering changing the title, making it more explicit in communicating the main contribution/result of the work.

      Following the reviewer’s suggestion, we have introduced an explanatory subtitle: “Linking Species Interactions to Dysbiosis through a Disordered Lotka-Volterra Framework”.

      (2) Review the references - some preprints might have already been published: Pasqualini J. 2023, Sireci 2022, Wu 2021.

      We thank the reviewer for pointing our attention to this inaccuracy. We updated the references to Pasqualini and Sireci papers. To our knowledge, Wu’s paper has appeared as an arXiv preprint only.

      (3) Species do not generally exhibit identical carrying capacities (see Grilli, Nat. Commun., 2020; some taxa are generally more abundant than others. The authors could discuss whether the model, with the inferred parameters, can accurately reproduce the distribution of species’ mean abundances.

      We thank the reviewer for this insightful comment. As discussed in the revised manuscript (lines 294–299), our current model does not accurately reproduce the empirical species abundance distribution (SAD). This limitation stems from the assumption of constant carrying capacities across species. While empirical observations (e.g., Grilli et al., Nat. Commun., 2020 [1]) show heterogeneous mean abundances often following power-law or log-normal distributions. However, our model assumes constant carrying capacity, resulting in SADs devoid of fat tails, which diverge from empirical data.

      This simplification is implemented to maintain the analytical tractability of the disordered generalized Lotka-Volterra (dGLV) framework, a common approach also found in prior works such as Bunin (2017) and Barbier et al. (2018) [2, 3]. Introducing heterogeneity in carrying capacities, such as drawing them from a log-normal distribution, or switching to multiplicative (rather than demographic) noise, could indeed produce SADs that better align with empirical data. Nevertheless, implementing changes would significantly complicate the analytical treatment.

      We acknowledge these directions as promising avenues for future research. They could help enhance the empirical realism of the model and its capacity to capture observed macroecological patterns while posing new theoretical challenges for disordered systems analysis

      (4) A substantial number of cited works (Grilli, Nat. Commun., 2020; Zaoli & Grilli, Science Advances, 2021; Sireci et al., PNAS, 2023; Po-Yi Ho et al., eLife, 2022) suggest that environmental fluctuations play a crucial role in shaping microbiome composition and dynamics. Is the authors’ analysis consistent with this perspective? Do they expect their conclusions to remain robust if environmental fluctuations are introduced?

      We thank the reviewer for stressing this point. The introduction of environmental fluctuations in the model formally violates detailed balance, thereby preventing the definition of an energy function. To date, no study has integrated random interactions together with both demographic and environmental noise within a unified analytical framework. This is certainly a highly promising direction that some of the authors are already exploring. However, given the inherently out-of-equilibrium nature of the system and the absence of a free energy, we would need to adopt a Dynamical Mean-Field Theory formalism and eventually analyze the corresponding stationary equations to be solved self-consistently. We added, however, a brief note in the Discussion section.

      (5) The term “order parameters“ may not be intuitive for a biological audience. In any case, the authors should explicitly define each order parameter when first introduced.

      We thank the reviewer for the comment. We introduced the names of the order parameters as soon as they are introduced, along with a brief explanation of their meaning that may be accessible to an audience with biological background.

      (6) Line 242: Should ψU be ψD?

      We thank the reviewer for the observation. We corrected the typo.

      (7) Given that the authors are discussing healthy and diseased states and to avoid confusion, the authors could perhaps use another word for ’pathological’ when they refer to dynamical regimes (e.g., in Appendix 2: ’letting the system enter the pathological regime of unbounded growth’).

      We thank the reviewer for the helpful comment. As suggested, we used the term “unphysical” instead of “pathological” where needed.

      Reviewer #2:

      (1) A technical point that I could not understand is how the authors deal with compositional data. One reason for my confusion is that the order parameters h and q0 are fixed n data to 1/S and 1/S2, and thus I do not see how they can be informative. Same for carrying capacity, why is it not 1 if considering relative abundance?

      We thank the reviewer for raising this point. We acknowledge that the treatment of compositional data and the interpretation of order parameters h and q0 were not sufficiently clarified in the manuscript. Additionally, there was an imprecision in the text regarding the interpretation of these parameters.

      As defined in revised Eq. (4) of the manuscript, h and q0 are to be averaged over the entire dataset, summing across samples α. Specifically, and , where S<sub>α</sub> is the number of species present in sample α and is the average over samples. These parameters are therefore informative, as they encapsulate sample-level ecological diversity, and their variation reflects biological differences between healthy and diseased states. For instance, Pasqualini et al., 2024 [4] reported significant differences in these metrics between health conditions, thereby supporting their ecological relevance.

      Regarding carrying capacities, we clarify that although we work with relative abundance data (i.e., compositional data), we do not fix the carrying capacity K to 1. Instead, we set K to the maximum value of xi (relative abundance) within each sample, to preserve compatibility with empirical data and allow for coexistence. While this remains a modeling assumption, it ensures better ecological realism within the constraints of the disordered GLV framework.

      (2) Obviously I’m missing something, so it would be nice to clarify in simple terms the logic of the argument. I understand that Lagrange multipliers are going to be used in the model analysis, and there are a lot of technical arguments presented in the paper, but I would like a much more intuitive explanation about the way the data can be used to infer order parameters if those are fixed by definition in compositional data.

      We thank the reviewer for the observation. The order parameters can be measured directly from the data, even in the presence of compositionality, as explained above. We can connect those parameters with the theory even for compositional data, because the only effect of adding the compositionality constraint is to shift the linear coefficient in the Hamiltonian, which corresponds to shifting the average interaction µ. However, the resulting phase diagram is mostly affected by the variance of the interactions σ2 (as µ is such that we are in the bounded phase).

      (3) Another point that I did not understand comes from the fact that the authors claim that interaction variance is smaller in unhealthy microbiomes. Yet they also find that those are closer to instability, and are more driven by niche processes. I would have expected the opposite to be true, more variance in the interactions leading to instability (as in May’s original paper for instance). Is this apparent paradox explained by covariations in demographic stochasticity (T) and immigration rate (lambda)? If so, I think it would be very useful to comment on that.

      As Altieri and coworkers showed in their PRL (2021) [5], the phase diagram of our model differs fundamentally from that of Biroli et al. (2018) [6]. In the latter, the intuitive rule – greater interaction variance yields greater instability – indeed holds. For the sake of clarity, we have attached below the resulting phase diagram obtained by Altieri et al.

      The apparent paradox arises because the two phase diagrams are tuned by different parameters. Consequently, even at low temperature and with weak interaction variance, our system may sit nearer to the replica-symmetrybreaking (RSB) line.

      Fig. 3 in the main text it is not a (σ,T) phase diagram where all other parameters are kept constant. Rather, it is a plot of the inferred σ and T parameters from the data (without showing the corresponding µ).

      To capture the full, non-trivial influence of all parameters on stability, we studied the so-called “replicon eigenvalue” in the RS (i.e. single equilibrium) approximation. This leading eigenvalue measures how close a given set of inferred parameters – and hence a microbiome – is to the RSB threshold. For a visual representation of these findings, refer to Figure 4.

      Author response image 1.

      (4) What do the empirical SAD look like? It would be nice to see the actual data and how the theoretical SADs compare.

      The empirical species abundance distributions (SADs) analyzed in our study are presented and discussed in detail in Pasqualini et al., 2024 [4]. Given the overlap in content, we chose not to reproduce these figures in the current manuscript to avoid redundancy.

      As we also clarify in the revised text, the theoretical SAD is derived from the disordered generalized Lotka-Volterra (dGLV) model in the unique fixed point phase typically exhibit exponential tails. These distributions do not match the heavier-tailed patterns (e.g., log-normal or power-law-like) observed in empirical microbiome data. This discrepancy stems from the simplifying assumptions of the dGLV framework, including the use of constant carrying capacities and demographic noise.

      In the revised manuscript, we have added a brief discussion in the revised manuscript to explicitly acknowledge this limitation and emphasize it as a direction for future refinement of the model, such as incorporating heterogeneous carrying capacities or exploring alternative noise structures.

      (5) Some typos: often “niche” is written “nice”.

      We thank the reviewer for this suggestion. After inspecting the text, we corrected the reported typos.

      Reviewer #3:

      Major comments:

      (1) In the S3 text, the authors say that filtered metagenomic reads were processed using the software Kaiju. The description of the pipeline does not mention how core genes were selected, which is often a crucial step in determining the abundance of a species in a metagenomic sample. In addition, the senior author of this manuscript has published a version of Kaiju that leverages marker genes classification methods (deemed Core-Kaiju), but it was not used for either this manuscript or Pasqualini et al. (2014; Tovo et al., 2020). I am not suggesting that the data necessarily needs to be reprocessed, but it would be useful to know how core genes were chosen in Pasqualini et al. and why Core-Kaiju was not used (2014).

      Prior to the current manuscript and the PLOS Computational Biology paper by Pasqualini et al. [4], we applied the core-Kaiju protocol to the same dataset used in both studies. However, this tool was originally developed and validated using general catalogs of culturable organisms, not specifically tuned for gut microbiomes. As a result, we have realized that in many samples Core Kajiu would filter only very few species (in some samples, the number of identified species was as low as 5–10), undermining the reliability of the analysis. Due to these limitations, we opted to use the standard Kaiju version in our work. We are actively developing an improved version of the core-Kaiju protocol that will overcome the discussed limitations and preliminary results (not shown here) indicate the robustness of the obtained patterns also in this case.

      (2) My understanding of Pasqualini et al. was that diseased patients experienced larger fluctuations in abundance, while in this study, they had smaller fluctuations (Figure 3a; 2024). Is this a discrepancy between the two models or is there a more nuanced interpretation?

      We thank the reviewer for the observation. This is only an apparent discrepancy, as the term fluctuation has different meanings in the two contexts. The fluctuations referred to by the reviewer correspond to a parameter of our theory—namely, noise in the interactions. Conversely, in Pasqualini et al. σ indicates environmental fluctuations. Nevertheless, there is no conceptual discrepancy in our results: in both studies, unhealthy microbiomes were found to be less stable. In fact, also in this study, notably Fig. 4, shows that unhealthy microbiomes lie closer to the RSB line, a phenomenon that is also associated with enhanced fluctuations.

      (3) Line 38-41: It would be helpful to explicitly state what “interaction patterns” are being referenced here. The final sentence could also be clarified. Do microbiomes “host“ interactions or are they better described as a property (“have”, “harbor”). The word “host” may confuse some readers since it is often used to refer to the human host. I am also not sure what point is being made by “expected to govern natural ones”. There are interactions between members of a microbiome; experimental studies have characterized some of these interactions, which we expect to relate in some way to interactions in nature. Is this what the authors are saying?

      Thanks. We agree that this sentence was not clear. Indeed, we are referring to pairwise species interactions and not to host-microbiome interactions. We have rewritten this part in the following way: In fact, recent work shows that the network-level properties of species-species interactions —for example, the sign balance, average strength, and connectivity of the inferred interaction matrix— shift systematically between healthy and dysbiotic gut communities (see for instance, [7, 8]). Pairwise species interactions have been quantified in simplified in-vitro consortia [9, 10]; we assume that the same classes of interactions also operate—albeit in a more complex form—in the native gut microbiome.

      (4) Line 43: I appreciate that the authors separated neutral vs. logistic models here.

      (5) Lines 51-75: The framing here is well-written and convincing. Network inference is an ongoing, active subject in ecology, and there is an unfortunate focus on inferring every individual interaction because ecologists with biology backgrounds are not trained to think about the problem in the language of statistical physics.

      We thank the reviewer for these positive comments.

      (6) Line 87: Perhaps I’m missing something obvious, but I don’t see how ρi sets the intrinsic timescale of the dynamics when its units are 1/(time*individuals), assuming the dimensions of ri are inverse time.

      We thank the reviewer for the observation. We corrected this phrase in the main text.

      (7) Lines 189-190: “as close as possible to the data” it would aid the reader if you specified the criteria meant by this statement.

      We thank the reviewer for the observation. We removed the sentence, as it introduced some redundancy in our argument. In the subsequent text, the proposed method is exposed in details.

      (8) Line 198: It would aid the reader if you provided some context for what the T - σ plane represents.

      We thank the referee for the helpful indication. Indeed, we have better clarified the mutual role of the demographic noise amplitude and strength of the random interaction matrix, as theoretically predicted in the PRL (2021) by Altieri and coworkers [5]. Please, find an additional paragraph on page 6 of the resubmitted version.

      (9) Line 217: Specifying what is meant by “internal modes“ would aid the typical life science reader.

      We thank the reviewer for the suggestion. Recognizing that referring to “internal modes” to describe the SAD shape in that context might cause confusion, we replaced “internal modes“ with “peaks”.

      (10) Line 219: Some additional justification and clarification are needed here, as some may think of “m“ as being biomass.

      We added a sentence to better explain this concept. “In classical and quantum field theory, the particle-particle interaction embedded in the quadratic term is typically referred to as a mass source. In the context of this study, captures quadratic fluctuations of species abundances, as also appearing in the expression of the leading eigenvalue of the stability matrix.”

      Minor comments:

      (1) I commend the authors for removing metagenomic reads that mapped to the human genome in the preprocessing stage of their pipeline. This may seem like an obvious pre-processing step, but it is unfortunately not always implemented.

      We thank the referee for pointing this potential issue. The data used in this work, as well as the bioinformatic workflow used to generate them has been described in detail in Pasqualini et al., 2024 [4]. As one of the main steps for preprocessing, we remove reads mapping to the human genome.

      (2) Line 13: “Bacterial“ excludes archaea, and while you may not have many high-abundance archaea in your human gut data, this sentence does not specify the human gut. Usually, this exclusion is averted via the term “microbial“, though sometimes researchers raise objections to the term when the data does not include fungal members (e.g., all 16S studies).

      We thank the reviewer for this suggestion. As to include archaeal organisms, we adopt the term “microbial“ instead of “bacterial“.

      (3) Line 18: This manuscript is being submitted under the “Physics of Living Systems“ tract, but it may be useful to explicitly state in the Abstract that disordered systems are a useful approach for understanding large, complex communities for the benefit of life science researchers coming from a biology background.

      Thank. We have modified the abstract following this suggestion.

      (4) Line 68: Consider using “adapted“ or something similar instead of “mutated“ if there is no specific reason for that word choice.

      We thank the reviewer for this suggestion, which was implemented in the text.

      (5) Line 111: It would be useful to define annealed and quenched for a general life science audience.

      We thank the reviewer for this suggestion. In the “Results” section, we have opted for “time-dependent disordered interactions” to reach a broader audience and avoid any jargon. Moreover, in the Discussion we added a detailed footnote: “In contrast to the quenched approximation, the annealed version assumes that the random couplings are not fixed but instead fluctuate over time, with their covariance governed by independent Ornstein–Uhlenbeck processes.”

      (6) Line 124: Likewise for the replicon sector.

      We thank the reviewer for the suggestion. We added a footnote on page 4, after the formula, to highlight the physical intuition behind the introduction of the replicon mode.

      “The replicon eigenvalue refers to a particular type of fluctuation around the saddle-point (mean-field) solution within the replica framework. When the Hessian matrix of the replicated free energy is diagonalized, fluctuations are divided into three sectors: longitudinal, anomalous, and replicon. The replicon mode is the most sensitive to criticality signaling – by its vanishing trend – the emergence of many nearly-degenerate states. It essentially describes how ‘soft’ the system is to microscopic rearrangements in configuration space.”

      (7) Figure 2: It would be helpful to include y-axis labels for each order parameter alongside the mathematical notation.

      We thank the reviewer for this suggestion. Now the y-axis of Figure 2 includes, along the mathmetical symbol, the label of the represented quantities.

      (8) Line 242: Subscript “U” is used to denote “Unhealthy” microbiomes, but “D” is used to denote “Diseased” in Figs. 2 and 3 (perhaps elsewhere as well).

      We thank the reviewer for this observation. After checking the various subscripts in the text, coherently with figure 2 and 3, we homogenized our notation, adopting the subscript “D“ for symbols related to the diseased/unhealthy condition.

      (9) Line 283: “not to“ should be “not due to“

      We thank the reviewer for this suggestion. After inspecting the text, we corrected the reported error.

      (10) Equations 23, 34: Extra “=“ on the RHS of the first line.

      We consistently follow the same formatting across all the line breaks in the equations throughout the text.

      We are thus resubmitting our paper, hoping to have satisfactorily addressed all referees’ concerns.

      References

      (1) Jacopo Grilli. Macroecological laws describe variation and diversity in microbial communities. Nature communications, 11(1):4743, 2020.

      (2) Guy Bunin. Ecological communities with lotka-volterra dynamics. Physical Review E, 95(4):042414, 2017.

      (3) Matthieu Barbier, Jean-Franc¸ois Arnoldi, Guy Bunin, and Michel Loreau. Generic assembly patterns in complex ecological communities. Proceedings of the National Academy of Sciences, 115(9):2156–2161, 2018.

      (4) Jacopo Pasqualini, Sonia Facchin, Andrea Rinaldo, Amos Maritan, Edoardo Savarino, and Samir Suweis. Emergent ecological patterns and modelling of gut microbiomes in health and in disease. PLOS Computational Biology, 20(9):e1012482, 2024.

      (5) Ada Altieri, Felix Roy, Chiara Cammarota, and Giulio Biroli. Properties of equilibria and glassy phases of the random lotka-volterra model with demographic noise. Physical Review Letters, 126(25):258301, 2021.

      (6) Giulio Biroli, Guy Bunin, and Chiara Cammarota. Marginally stable equilibria in critical ecosystems. New Journal of Physics, 20(8):083051, 2018.

      (7) Amir Bashan, Travis E Gibson, Jonathan Friedman, Vincent J Carey, Scott T Weiss, Elizabeth L Hohmann, and Yang-Yu Liu. Universality of human microbial dynamics. Nature, 534(7606):259–262, 2016.

      (8) Marcello Seppi, Jacopo Pasqualini, Sonia Facchin, Edoardo Vincenzo Savarino, and Samir Suweis. Emergent functional organization of gut microbiomes in health and diseases. Biomolecules, 14(1):5, 2023.

      (9) Jared Kehe, Anthony Ortiz, Anthony Kulesa, Jeff Gore, Paul C Blainey, and Jonathan Friedman. Positive interactions are common among culturable bacteria. Science advances, 7(45):eabi7159, 2021.

      (10) Ophelia S Venturelli, Alex V Carr, Garth Fisher, Ryan H Hsu, Rebecca Lau, Benjamin P Bowen, Susan Hromada, Trent Northen, and Adam P Arkin. Deciphering microbial interactions in synthetic human gut microbiome communities. Molecular systems biology, 14(6):e8157, 2018.

    2. Reviewer #1 (Public review):

      Summary:

      In this manuscript, the authors develop a novel method to infer ecologically-informative parameters across healthy and diseased states of the gut microbiota, although the method is generalizable to other datasets for species abundances. The authors leverage techniques from theoretical physics of disordered systems to infer different parameters-mean and standard deviation for the strength of bacterial interspecies interactions, a bacterial immigration rate, and the strength of demographic noise-that describe the statistics of microbiota samples from two groups-one for healthy subjects and another one for subjects with chronic inflammation syndromes. To do this, the authors simulate communities with a modified version of the Generalized Lotka-Volterra model and randomly-generated interactions, and then use a moment-matching algorithm to find sets of parameters that better reproduce the data for species abundances. They find that these parameters are different for the healthy and diseased microbiota groups. The results suggest, for example, that bacterial interaction strengths, relative to noise and immigration, are more dominant of microbiota dynamics in diseased states than in healthy states.

      We think that this manuscript brings an important contribution that will be of interest in the areas of statistical physics, (microbiota) ecology and (biological) data science. The evidence of their results is solid and the work improves the state-of-the-art in terms of methods.

      Strengths:

      • Using a fairly generic ecological model, the method can identify the change in the relative importance of different ecological forces (distribution of interspecies interactions, demographic noise and immigration) in different sample groups. The authors focus on the case of the human gut microbiota, showing that the data is consistent with a higher influence of species interactions (relative to demographic noise and immigration) in a disease microbiota state than in healthy ones.

      • The method is novel, original and it improves the state-of-the-art methodology for the inference of ecologically-relevant parameters. The analysis provides solid evidence on the conclusions.

      Weaknesses:

      • As a proof of concept for a new inference method, this text maintains a technical focus, which may require some familiarity with statistical physics. Nevertheless, the authors' clear introduction of key mathematical terms and their interpretations, along with a clear discussion of the ecological implications, make the results accessible and easy to follow.
    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      Summary:

      The study explored the biomechanics of kangaroo hopping across both speed and animal size to try and explain the unique and remarkable energetics of kangaroo locomotion.

      Strengths:

      The study brings kangaroo locomotion biomechanics into the 21st century. It is a remarkably difficult project to accomplish. There is excellent attention to detail, supported by clear writing and figures.

      Weaknesses:

      The authors oversell their findings, but the mystery still persists. 

      The manuscript lacks a big-picture summary with pointers to how one might resolve the big question.

      General Comments

      This is a very impressive tour de force by an all-star collaborative team of researchers. The study represents a tremendous leap forward (pun intended) in terms of our understanding of kangaroo locomotion. Some might wonder why such an unusual species is of much interest. But, in my opinion, the classic study by Dawson and Taylor in 1973 of kangaroos launched the modern era of running biomechanics/energetics and applies to varying degrees to all animals that use bouncing gaits (running, trotting, galloping and of course hopping). The puzzling metabolic energetics findings of Dawson & Taylor (little if any increase in metabolic power despite increasing forward speed) remain a giant unsolved problem in comparative locomotor biomechanics and energetics. It is our "dark matter problem".

      Thank you for the kind words.

      This study is certainly a hop towards solving the problem. But, the title of the paper overpromises and the authors present little attempt to provide an overview of the remaining big issues. 

      We have modified the title to reflect this comment.  “Postural adaptations may contribute to the unique locomotor energetics seen in hopping kangaroos”

      The study clearly shows that the ankle and to a lesser extent the mtp joint are where the action is. They clearly show in great detail by how much and by what means the ankle joint tendons experience increased stress at faster forward speeds.

      Since these were zoo animals, direct measures were not feasible, but the conclusion that the tendons are storing and returning more elastic energy per hop at faster speeds is solid. The conclusion that net muscle work per hop changes little from slow to fast forward speeds is also solid. 

      Doing less muscle work can only be good if one is trying to minimize metabolic energy consumption. However, to achieve greater tendon stresses, there must be greater muscle forces. Unless one is willing to reject the premise of the cost of generating force hypothesis, that is an important issue to confront. Further, the present data support the Kram & Dawson finding of decreased contact times at faster forward speeds. Kram & Taylor and subsequent applications of (and challenges to) their approach supports the idea that shorter contact times (tc) require recruiting more expensive muscle fibers and hence greater metabolic costs. Therefore, I think that it is incumbent on the present authors to clarify that this study has still not tied up the metabolic energetics across speed problems and placed a bow atop the package. 

      Fortunately, I am confident that the impressive collective brain power that comprises this author list can craft a paragraph or two that summarizes these ideas and points out how the group is now uniquely and enviably poised to explore the problem more using a dynamic SIMM model that incorporates muscle energetics (perhaps ala' Umberger et al.). Or perhaps they have other ideas about how they can really solve the problem.

      You have raised important points, thank you for this feedback. We have added a limitations and considerations section to the discussion which highlights that there are still unanswered questions. Line 311-328

      Considerations and limitations

      “First, we believe it is more likely that the changes in moment arms and EMA can be attributed to speed rather than body mass, given the marked changes in joint angles and ankle height observed at faster hopping speeds. However, our sample included a relatively narrow range of body masses (13.7 to 26.6 kg) compared to the potential range (up to 80 kg), limiting our ability to entirely isolate the effects of speed from those of mass. Future work should examine a broader range of body sizes. Second, kangaroos studied here only hopped at relatively slow speeds, which bounds our estimates of EMA and tendon stress to a less critical region. As such, we were unable to assess tendon stress at fast speeds, where increased forces would reduce tendon safety factors closer to failure. A different experimental or modelling approach may be needed, as kangaroos in enclosures seem unwilling to hop faster over force plates. Finally, we did not determine whether the EMA of proximal hindlimb joints (which are more difficult to track via surface motion capture markers) remained constant with speed. Although the hip and knee contribute substantially less work than the ankle joint (Fig. 4), the majority of kangaroo skeletal muscle is located around these proximal joints. A change in EMA at the hip or knee could influence a larger muscle mass than at the ankle, potentially counteracting or enhancing energy savings in the ankle extensor muscle-tendon units. Further research is needed to understand how posture and muscles throughout the whole body contribute to kangaroo energetics.”

      Additionally, we added a line “Peak GRF also naturally increased with speed together with shorter ground contact durations (Fig. 2b, Suppl. Fig 1b)” (line 238) to highlight that we are not proposing that changes in EMA alone explain the full increase in tendon stress. Both GRF and EMA contribute substantially (almost equally) to stress, and we now give more equal discussion to both. For instance, we now also evaluate how much each contributes: “If peak GRF were constant but EMA changed from the average value of a slow hop to a fast hop, then stress would increase 18%, whereas if EMA remained constant and GRF varied by the same principles, then stress would only increase by 12%. Thus, changing posture and decreasing ground contact duration both appear to influence tendon stress for kangaroos, at least for the range of speeds we examined” (Line 245-249)

      We have added a paragraph in the discussion acknowledging that the cost of generating force problem is not resolved by our work, concluding that “This mechanism may help explain why hopping macropods do not follow the energetic trends observed in other species (Dawson and Taylor 1973, Baudinette et al. 1992, Kram and Dawson 1998), but it does not fully resolve the cost of generating force conundrum” Line 274-276.

      I have a few issues with the other half of this study (i.e. animal size effects). I would enjoy reading a new paragraph by these authors in the Discussion that considers the evolutionary origins and implications of such small safety factors. Surely, it would need to be speculative, but that's OK.

      We appreciate this comment from the reviewer, however could not extend the study to discuss animal size effects because, as we now note in the results: “The range of body masses may not be sufficient to detect an effect of mass on ankle moment in addition to the effect of speed.” Line 193

      Reviewer #2 (Public Review):

      Summary

      This is a fascinating topic that has intrigued scientists for decades. I applaud the authors for trying to tackle this enigma. In this manuscript, the authors primarily measured hopping biomechanics data from kangaroos and performed inverse dynamics. 

      While these biomechanical analyses were thorough and impressively incorporated collected anatomical data and an Opensim model, I'm afraid that they did not satisfactorily address how kangaroos can hop faster and not consume more metabolic energy, unique from other animals.  Noticeably, the authors did not collect metabolic data nor did they model metabolic rates using their modelling framework. Instead, they performed a somewhat traditional inverse dynamics analysis from multiple animals hopping at a self-selected speed.

      In the current study, we aimed to provide a joint-level explanation for the increases of tendon stress that are likely linked to metabolic energy consumption.

      We have now included a limitations section in the manuscript (See response to Rev 1). We plan to expand upon muscle level energetics in the future with a more detailed musculoskeletal model.

      Within these analyses, the authors largely focused on ankle EMA, discussing its potential importance (because it affects tendon stress, which affects tendon strain energy, which affects muscle mechanics) on the metabolic cost of hopping. However, EMA was roughly estimated (CoP was fixed to the foot, not measured) and did not detectibly associate with hopping speed (see results Yet, the authors interpret their EMA findings as though it systematically related with speed to explain their theory on how metabolic cost is unique in kangaroos vs. other animals

      As noted in our methods, EMA was not calculated from a fixed centre of pressure (CoP). We did fix the medial-lateral position, owing to the fact that both feet contacted the force plate together, but the anteroposterior movement of the CoP was recorded by the force plate and thus allowed to move. We report the movement (or lack of movement) in our results. The anterior-posterior axis is the most relevant to lengthening or shortening the distance of the ‘out-lever’ R, and thereby EMA. It is necessary to assume fixed medial-lateral position because a single force trace and CoP is recorded when two feet land on the force plate. The mediallateral forces on each foot cancel out so there is no overall medial-lateral movement if the forces are symmetrical (e.g. if the kangaroo is hopping in a straight path and one foot is not in front of the other). We only used symmetrical trials so that the anterior-posterior movement of the CoP would be reliable. We have now added additional details into the text to clarify this

      Indeed, the relationship between R and speed (and therefore EMA and speed) was not significant. However, the significant change in ankle height with speed, combined with no systematic change in COP at midstance, demonstrates that R would be greater at faster speeds. If we consider the nonsignificant relationship between R and speed to indicate that there is no change in R, then these two results conflict. We could not find a flaw in our methods, so instead concluded that the nonsignificant relationship between R and speed may be due to a small change in R being undetectable in our data. Taking both results into account, we believe it is more likely that there is a non-detectable change in R, rather than no change in R with speed, but we presented both results for transparency. We have added an additional section into the results to make this clearer (Line 177-185) “If we consider the nonsignificant relationship between R (and EMA) and speed to indicate that there is no change in R, then it conflicts with the ankle height and CoP result. Taking both into account, we think it is more likely that there is a small, but important, change in R, rather than no change in R with speed. It may be undetectable because we expect small effect sizes compared to the measurement range and measurement error (Suppl. Fig. 3h), or be obscured by a similar change in R with body mass. R is highly dependent on the length of the metatarsal segment, which is longer in larger kangaroos (1 kg BM corresponded to ~1% longer segment, P<0.001, R<sup>2</sup>=0.449). If R does indeed increase with speed, both R and r will tend to decrease EMA at faster speeds.”

      These speed vs. biomechanics relationships were limited by comparisons across different animals hopping at different speeds and could have been strengthened using repeated measures design

      There is significant variation in speed within individuals, not just between individuals. The preferred speed of kangaroos is 2-4.5 m/s, but most individuals showed a wide speed range within this. Eight of our 16 kangaroos had a maximum speed that was 1-2m/s faster than their slowest trial. Repeated measures of these eight individuals comprises 78 out of the 100 trials.   It would be ideal to collect data across the full range of speeds for all individuals, but it is not feasible in this type of experimental setting. Interference with animals such as chasing is dangerous to kangaroos as they are prone to adverse reactions to stress. We have now added additional information about the chosen hopping speeds into the results and methods sections to clarify this “The kangaroos elected to hop between 1.99 and 4.48 m s<sup>-1</sup>, with a range of speeds and number of trials for each individual (Suppl. Fig. 9).”  (Line 381-382)

      There are also multiple inconsistencies between the authors' theory on how mechanics affect energetics and the cited literature, which leaves me somewhat confused and wanting more clarification and information on how mechanics and energetics relate

      We thank the reviewer for this comment. Upon rereading we now understand the reviewers position, and have made substantial revisions to the introduction and discussion (See comments below) 

      My apologies for the less-than-favorable review, I think that this is a neat biomechanics study - but am unsure if it adds much to the literature on the topic of kangaroo hopping energetics in its current form.

      Again we thank the reviewer for their time and appreciate their efforts to strengthen our manuscript.

      Reviewer #3 (Public Review):

      Summary:

      The goal of this study is to understand how, unlike other mammals, kangaroos are able to increase hopping speed without a concomitant increase in metabolic cost. They use a biomechanical analysis of kangaroo hopping data across a range of speeds to investigate how posture, effective mechanical advantage, and tendon stress vary with speed and mass. The main finding is that a change in posture leads to increasing effective mechanical advantage with speed, which ultimately increases tendon elastic energy storage and returns via greater tendon strain. Thus kangaroos may be able to conserve energy with increasing speed by flexing more, which increases tendon strain.

      Strengths:

      The approach and effort invested into collecting this valuable dataset of kangaroo locomotion is impressive. The dataset alone is a valuable contribution.

      Thank you!

      Weaknesses:

      Despite these strengths, I have concerns regarding the strength of the results and the overall clarity of the paper and methods used (which likely influences how convincingly the main results come across).

      (1) The paper seems to hinge on the finding that EMA decreases with increasing speed and that this contributes significantly to greater tendon strain estimated with increasing speed. It is very difficult to be convinced by this result for a number of reasons:

      It appears that kangaroos hopped at their preferred speed. Thus the variability observed is across individuals not within. Is this large enough of a range (either within or across subjects) to make conclusions about the effect of speed, without results being susceptible to differences between subjects? 

      Apologies, this was not clear in the manuscript. Kangaroos hopping at their preferred speed means we did not chase or startle them into high speeds to comply with ethics and enclosure limitations. Thus we did not record a wide range of speeds within the bounds of what kangaroos are capable of in the wild (up to 12 m/s), but for the range we did measure (~2-4.5 m/s), there is a large amount of variation in hopping speed within each individual kangaroo. Out of 16 individuals, eight individuals had a difference of 1-2m/s between their slowest and fastest trials, and these kangaroos accounted for 78 out of 100 trials. Of the remainder, six individuals had three for fewer trials each, and two individuals had highly repeatable speeds (3 out of 4, and 6 out of 7 trials were within 0.5 m/s). We have now removed the terminology “preferred speed” e.g line 115. We have added additional information about the chosen hopping speeds into the results and methods, including an appendix figure “The kangaroos elected to hop between 1.99 and 4.48 m s<sup>-1</sup>, with a range of speeds and number of trials for each individual (Suppl. Fig. 9).” (Line 381-382)

      In the literature cited, what was the range of speeds measured, and was it within or between subjects?

      For other literature, to our knowledge the highest speed measured is ~9.5m/s (see supplementary Fig1b) and there were multiple measures for several individuals (see methods Kram & Dawson 1998). 

      Assuming that there is a compelling relationship between EMA and velocity, how reasonable is it to extrapolate to the conclusion that this increases tendon strain and ultimately saves metabolic cost?  They correlate EMA with tendon strain, but this would still not suggest a causal relationship (incidentally the p-value for the correlation is not reported). 

      The functions that underpin these results (e.g. moment = GRF*R) come from physical mechanics and geometry, rather than statistical correlations. Additionally, a p-value is not appropriate in the relationship between EMA and stress (rather than strain) because the relationship does not appear to be linear. We have made it clearer in the discussion that we are not proposing that entire change in stress is caused by changes in EMA, but that the increase in GRF that naturally occurs with speed will also explain some of the increase in stress, along with other potential mechanisms. The discussion has been extensively revised to reflect this. 

      Tendon strain could be increasing with ground reaction force, independent of EMA. Even if there is a correlation between strain and EMA, is it not a mathematical necessity in their model that all else being equal, tendon stress will increase as ema decreases? I may be missing something, but nonetheless, it would be helpful for the authors to clarify the strength of the evidence supporting their conclusions.

      Yes, GRF also contributes to the increase in tendon stress in the mechanism we propose (Suppl. Fig. 8), see the formulas in Fig 6, and we have made this clearer in the revised discussion (see above comment).  You are correct that mathematically stress is inversely proportional to EMA, which can be observed in Fig. 7a, and we did find that EMA decreases. 

      The statistical approach is not well-described. It is not clear what the form of the statistical model used was and whether the analysis treated each trial individually or grouped trials by the kangaroo. There is also no mention of how many trials per kangaroo, or the range of speeds (or masses) tested. 

      The methods include the statistical model with the variables that we used, as well as the kangaroo masses (13.7 to 26.6 kg, mean: 20.9 ± 3.4 kg). We did not have sufficient within individual sample size to use a linear mixed effect model including subject as a random factor, thus all trials were treated individually. We have included this information in the results section. 

      We have now moved the range of speeds from the supplementary material to the results and figure captions. We have added information on the number of trials per kangaroo to the methods, and added Suppl. Fig. 9 showing the distribution of speeds per kangaroo.

      We did not group the data e.g. by using an average speed per individual for all their trials, or by comparing fast to slow groups for statistical analysis (the latter was only for display purposes in our figures, which we have now made clearer in the methods statistics section). 

      Related to this, there is no mention of how different speeds were obtained. It seems that kangaroos hopped at a self-selected pace, thus it appears that not much variation was observed. I appreciate the difficulty of conducting these experiments in a controlled manner, but this doesn’t exempt the authors from providing the details of their approach.

      Apologies, this was not clear in the manuscript. Kangaroos hopping at their preferred speed means we did not chase or startle them into high speeds to comply with ethics and enclosure limitations. Thus we did not record a wide range of speeds within the bounds of what kangaroos are capable of in the wild (up to 12 m/s). We have now removed the terminology “preferred speed” e.g. line 115. We have added additional information about the chosen hopping speeds into the results and methods, including an appendix figure (see above comment). (Line 381-382)

      Some figures (Figure 2 for example) present means for one of three speeds, yet the speeds are not reported (except in the legend) nor how these bins were determined, nor how many trials or kangaroos fit in each bin. A similar comment applies to the mass categories. It would be more convincing if the authors plotted the main metrics vs. speed to illustrate the significant trends they are reporting.

      Thank you for this comment. The bins are used only for display purposes and not within the statistical analysis. We have clarified this in the revised manuscript: “The data was grouped into body mass (small 17.6±2.96 kg, medium 21.5±0.74 kg, large 24.0±1.46 kg) and speed (slow 2.52±0.25 m s<sup>-1</sup>, medium 3.11±0.16 m s<sup>-1</sup>, fast 3.79±0.27 m s<sup>-1</sup>) subsets for display purposes only”. (Line 495-497)

      (2) The significance of the effects of mass is not clear. The introduction and abstract suggest that the paper is focused on the effect of speed, yet the effects of mass are reported throughout as well, without a clear understanding of the significance. This weakness is further exaggerated by the fact that the details of the subject masses are not reported.

      Indeed, the primary aim of our study was to explore the influence of speed, given the uncoupling of energy from hopping speed in kangaroos. We included mass to ensure that the effects of speed were not driven by body mass (i.e.: that larger kangaroos hopped faster). Subject masses were reported in the first paragraph of the methods, albeit some were estimated as outlined in the same paragraph.

      (3) The paper needs to be significantly re-written to better incorporate the methods into the results section. Since the results come before the methods, some of the methods must necessarily be described such that the study can be understood at some level without turning to the dedicated methods section. As written, it is very difficult to understand the basis of the approach, analysis, and metrics without turning to the methods.

      The methods after the discussion is a requirement of the journal. We have incorporated some methods in the results where necessary but not too repetitive or disruptive, e.g. Fig. 1 caption, and specifying we are only analysing EMA for the ankle joint

      Reviewing Editor (Recommendations For The Authors):

      Below is a list of specific recommendations that the authors could address to improve the eLife assessment:

      (1) Based on the data presented and the fact that metabolic energy was not measured, the authors should temper their conclusions and statements throughout the manuscript regarding the link between speed and metabolic energy savings. We recommend adding text to the discussion summarizing the strengths and limitations of the evidence provided and suggesting future steps to more conclusively answer this mystery.

      There is a significant body of work linking metabolic energy savings to measured increases in tendon stress in macropods. However, the purpose of this paper was to address the unanswered questions about why tendon stress increases. We found that stress did not only increase due to GRF increasing with speed as expected, but also due to novel postural changes which decreased EMA. In the revised manuscript, we have tempered our conclusions to make it clearer that it is not just EMA affecting stress, and added limitations throughout the manuscript (see response to Rev 1). 

      (2) To provide stronger evidence of a link between speed, mechanics, and metabolic savings the authors can consider estimating metabolic energy expenditure from their OpenSIM model. This is one suggestion, but the authors likely have other, possibly better ideas. Such a model should also be able to explain why the metabolic rate increases with speed during uphill hopping.

      Extending the model to provide direct metabolic cost estimates will be the goal of a future paper, however the models does not have detailed muscle characteristics to do this in the formulation presented here. It would be a very large undertaking which is beyond the scope of the current manuscript. As per the comment above, the results of this paper are not reliant on metabolic performance. 

      (3) The authors attempt to relate the newly quantified hopping biomechanics to previously published metabolic data. However, all reviewers agree that the logic in many instances is not clear or contradictory. Could one potential explanation be that at slow speeds, forces and tendon strain are small, and thus muscle fascicle work is high? Then, with faster speeds, even though the cost of generating isometric force increases, this is offset by the reduction in the metabolic cost of muscular work. The paper could provide stronger support for their hypotheses with a much clearer explanation of how the kinematics relate to the mechanics and ultimately energy savings.

      In response to the reviewers comments, we have substantially modified the discussion to provide clearer rationale.

      (4) The methods and the effort expended to collect these data are impressive, but there are a number of underlying assumptions made that undermine the conclusions. This is due partly to the methods used, but also the paper's incomplete description of their methods. We provide a few examples below:

      It would be helpful if the authors could speak to the effect of the limited speeds tested and between-animal comparisons on the ability to draw strong conclusions from the present dataset. ·

      Throughout the discussion, the authors highlight the relationship between EMA and speed. However, this is misleading since there was no significant effect of speed on EMA. Speed only affected the muscle moment arm, r. At minimum, this should be clarified and the effect on EMA not be overstated. Additionally, the resulting implications on their ability to confidently say something about the effect of speed on muscle stress should be discussed. 

      We have now provided additional details, (see responses above) to these concerns. For instance, we added a supplementary figure showing the speed distribution per individual. The primary reviewer concern (that each kangaroo travelled at a single speed) was due to a miscommunication around the terminology “preferred” which has now been corrected. 

      We now elaborate in the results why we are not very concerned that EMA is insignificant. The statistical insignificance of EMA is ultimately due to the insignificance of the direct measurement of R, however, we now better explain in the results why we believe that this statistical insignificance is due to error/noise of the measurement which is relatively large compared to the effect size. Indirect indications of how R may increase with speed (via ankle height from the ground) are statistically significant. Lines 177-185. 

      We consider this worth reporting because, for instance, an 18% change in EMA will be undetectable by measurement, but corresponds to an 18% change in tendon stress which is measurable and physiologically significant (safety factor would decrease from 2 to 1.67).  We presented both significant and insignificant results for transparency. 

      We have also discussed this within a revised limitations section of the manuscript (Line 311328). 

      Reviewer #1 (Recommendations For The Authors):

      Title: I would cut the first half of the title. At least hedge it a bit. "Clues" instead of "Unlocking the secrets".

      We have revised the title to: “Postural adaptations may contribute to the unique locomotor energetics seen in hopping kangaroos”

      In my comments, ... typically indicates a stylistic change suggested to the text.

      Overall, the paper covers speed and size. Unfortunately, the authors were not 100% consistent in the order of presenting size then speed, or speed then size. Just choose one and stick with it.

      We have attempted to keep the order of presenting size and speed consistent, however there are several cases where this would reduce the readability of the manuscript and so in some cases this may vary. 

      One must admit that there is a lot of vertical scatter in almost all of the plots. I understand that these animals were not in a lab on a treadmill at a controlled speed and the animals wear fur coats so marker placements vary/move etc. But the spread is quite striking, e.g. Figure 5a the span at one speed is almost 10x. Can the authors address this somewhere? Limitations section?

      The variation seen likely results from attempting to display data in a 2D format, when it is in fact the result of multiple variables, including speed, mass, stride frequency and subject specific lengths. Slight variations in these would be expected to produce some noise around the mean, and I think it’s important to consider this while showing the more dominant effects. 

      In many locations in the manuscript, the term "work" is used, but rarely if ever specified that this is the work "per hop". The big question revolves around the rate of metabolic energy consumption (i.e. energy per time or average metabolic power), one must not forget that hop frequency changes somewhat across speed, so work per hop is not the final calculation.

      Thank you for this comment. We have now explicitly stated work per hop in figure captions and in the results (line 208). The change in stride frequency at this range of speeds is very small, particularly compared to the variance in stride frequency (Suppl. Fig. 1d), which is consistent with other researchers who found that stride frequency was constant or near constant in macropods at analogous speeds (e.g. Dawson and Taylor 1973, Baudinette et al. 1987). 

      Line 61 ....is likely related.

      Added “likely” (line 59)

      Line 86 I think the Allen reference is incomplete. Wasn't it in J Exp Biology?

      Thank you. Changed. 

      Line 122 ... at faster speeds and in larger individuals.

      Changed: “We hypothesised that (i) the hindlimb would be more crouched at faster speeds, primarily due to the distal hindlimb joints (ankle and metatarsophalangeal), independent of changes with body mass” (Line 121-122).

      Line 124 I found this confusing. Try to re-word so that you explain you mean more work done by the tendons and less by the ankle musculature.

      Amended: “changes in moment arms resulting from the change in posture would contribute to the increase in tendon stress with speed, and may thereby contribute to energetic savings by increasing the amount of positive and negative work done by the ankle without requiring additional muscle work” (Line 123)

      Line 129 hopefully "braking" not "breaking"!

      Thank you. Fixed. (Line 130)

      Line 129 specify fore-aft horizontal force.

      Added "fore-aft" to "negative fore-aft horizontal component" (Line 130-131)

      Line 130 add something like "of course" or "naturally" since if there is zero fore-aft force, the GRF vector of course must be vertical. 

      Added "naturally" (Line 132)

      Line 138 clarify that this section is all stance phase. I don't recall reading any swing phase data.

      Changed to: "Kangaroo hindlimb stance phase kinematics varied…" (Line 141)

      Line 143 and elsewhere. I found the use of dorsiflexion and plantarflexion confusing. In Figure 3, I see the ankle never flexing more than 90 degrees. So, the ankle joint is always in something of a flexed position, though of course it flexes and extends during contact. I urge the authors to simplify to flextion/extension and drop the plantar/dorsi.

      We have edited this section to describe both movements as greater extension (plantarflexion). (Line 147). We have further clarified this in the figure caption for figure 3.  

      Line 147 ...changes were…

      Fixed, line 150

      Line 155 I'm a bit confused here. Are the authors calculating some sort of overall EMA or are they saying all of the individual joint EMAs all decreased?

      Thank you, we clarified that it is at the ankle. Line 158

      Line 158 since kangaroos hop and are thus positioned high and low throughout the stance phase, try to avoid using "high" and "low" for describing variables, e.g. GRF or other variables. Just use "greater/greatest" etc.

      Thanks for this suggestion. We have changed "higher" into "greater" where appropriate throughout the manuscript e.g. line 161

      Lines 162 and 168 same comment here about "r" and "R". Do you mean ankle or all joints?

      Clarified that it is the gastrocnemius and plantaris r, and the R to the ankle. (Lines 164-165)

      Line 173 really, ankle height?

      Added: ankle height is "vertical distance from the ground". Line 177

      Line 177 is this just the ankle r?

      Added "of the ankle" line 158 and “Achilles” line 187 

      Line 183 same idea, which tendon/tendons are you talking about here?

      Added "Achilles" to be more clear (Line 187)

      Line 195 substitute "converted" for "transferred".

      Done (Line 210)

      Line 223 why so vague? i.e. why use "may"? Believe in your data. ...stress was also modulated by changes....

      Changed "may" to "is"

      Line 229 smaller ankle EMA (especially since you earlier talked about ankle "height").

      Changed “lower” to “smaller” Line 254

      Line 2236 ...and return elastic energy…

      Added "elastic" line 262

      Line 244 IMPORTANT: Need to explain this better! I think you are saying that the net work at the ankle is staying the same across speed, BUT it is the tendons that are storing and returning that work, it's not that the muscles are doing a lot of negative/positive work.

      Changed: “The consistent net work observed among all speeds suggests the ankle extensor muscle-tendon units are performing similar amounts of ankle work independent of speed, which would predominantly be done by the tendon.” Line 270-272)

      Line 258-261 I think here is where you are over-selling the data/story. Although you do say "a" mechanism (and not "the" mechanism, you still need to deal with the cost of generating more force and generating that force faster.

      We removed this sentence and replaced it with a discussion of the cost of generating force hypothesis, and alternative scenarios for the how force and metabolics could be uncoupled. 

      Line 278 "the" tendon? Which tendon?

      Added "Achilles"

      Line 289. I don't think one can project into the past.

      Changed “projected” to "estimated"

      Line 303 no problem, but I've never seen a paper in biology where the authors admit they don't know what species they were studying!

      Can’t be helped unfortunately. It is an old dataset and there aren’t photos of every kangaroo. Fortunately, from the grey and red kangaroos we can distinguish between, we know there are no discernible species effects on the data. 

      Lines 304-306 I'm not clear here. Did you use vertical impulse (and aerial time) to calculate body weight? Or did you somehow use the braking/propulsive impulse to calculate mass? I would have just put some apples on the force plate and waited for them to stop for a snack.

      Stationary weights were recorded for some kangaroos which did stand on the force plate long enough, but unfortunately not all of them were willing to do so. In those cases, yes, we used impulse from steady-speed trials to estimate mass. We cross-checked by estimated mass from segment lengths (as size and mass are correlated). This is outlined in the first paragraph of the methods.

      Lines 367 & 401 When you use the word "scaled" do you mean you assumed geometric similarity?

      No, rather than geometric scaling, we allowed scaling to individual dimensions by using the markers at midstance for measurements. We have amended the paragraph to clarify that the shape of the kangaroo changes and that mass distribution was preserved during the shape change (line 441-446) 

      Lines 381-82 specify "joint work"

      Added "joint work"  (Line 457)

      Figure 1 is gorgeous. Why not add the CF equation to the left panel of the caption?

      We decided to keep the information in the figure caption. “Total leg length was calculated as the sum of the segment lengths (solid black lines) in the hindlimb and compared to the pelvisto-toe distance (dashed line) to calculate the crouch factor”

      Figure 2 specify Horizontal fore-aft.

      Done

      Figure 3g I'd prefer the same Min. Max Flexion vertical axis labels as you use for hip & knee.

      While we appreciate the reviewer trying to increase the clarity of this figure, we have left it as plantar/dorsi flexion since these are recognised biomechanical terms. To avoid confusion, we have further defined these in the figure caption “For (f-g), increased plantarflexion represents a decrease in joint flexion, while increased dorsiflexion represents increased flexion of the joint.”

      Figure 4. I like it and I think that you scaled all panels the same, i.e. 400 W is represented by the same vertical distance in all panels. But if that's true, please state so in the Caption. It's remarkable how little work occurs at the hip and knee despite the relatively huge muscles there.

      Is it true that the y axes are all at the same scale. We have added this to the caption. 

      Figure 5 Caption should specify "work per hop".

      Added

      Figure 7 is another beauty.

      Thank you!

      Supplementary Figure 3 is this all ANKLE? Please specify.

      Clarified that it is the gastrocnemius and plantaris r, and the R to the ankle.

      Reviewer #2 (Recommendations For The Authors):

      To 'unlock the secrets of kangaroo locomotor energetics' I expected the authors to measure the secretive outcome variable, metabolic rate using laboratory measures. Rather, the authors relied on reviewing historic metabolic data and collecting biomechanics data across different animals, which limits the conclusions of this manuscript.

      We have revised to the title to make it clearer that we are investigating a subset of the energetics problem, specifically posture. “Postural adaptations may contribute to the unique locomotor energetics seen in hopping kangaroos.” We have also substantially modified the discussion to temper the conclusions from the paper. 

      After reading the hypothesis, why do the authors hypothesize about joint flexion and not EMA? Because the following hypothesis discusses the implications of moment arms on tendon stress, EMA predictions are more relevant (and much more discussed throughout the manuscript).

      Ankle and MTP angles are the primary drivers of changes in r, R & thus, EMA. We used a two part hypothesis to capture this. We have rephased the hypotheses: “We hypothesised that (i) the hindlimb would be more crouched at faster speeds, primarily due to the distal hindlimb joints (ankle and metatarsophalangeal), independent of changes with body mass, and (ii) changes in moment arms resulting from the change in posture would contribute to the increase in tendon stress with speed, and may thereby contribute to energetic savings by increasing the amount of positive and negative work done by the ankle without requiring additional muscle work.”

      If there were no detectable effects of speed on EMA, are kangaroos mechanically like other animals (Biewener Science 89 & JAP 04) who don't vary EMA across speeds? Despite no detectible effects, the authors state [lines 228-229] "we found larger and faster kangaroos were more crouched, leading to lower ankle EMA". Can the authors explain this inconsistency? Lines 236 "Kangaroos appear to use changes in posture and EMA". I interpret the paper as EMA does not change across speed.

      Apologies, we did not sufficiently explain this originally. We now explain in the results our reasoning behind our belief that EMA and R may change with speed. “If we consider the nonsignificant relationship between R (and EMA) and speed to indicate that there is no change in R, then it conflicts with the ankle height and CoP result. Taking both into account, we think it is more likely that there is a small, but important, change in R, rather than no change in R with speed. It may be undetectable because we expect small effect sizes compared to the measurement range and measurement error (Suppl. Fig. 3h), or be obscured by a similar change in R with body mass. R is highly dependent on the length of the metatarsal segment, which is longer in larger kangaroos (1 kg BM corresponded to ~1% longer segment, P<0.001, R<sup>2</sup>=0.449). If R does indeed increase with speed, both R and r will tend to decrease EMA at faster speeds.” (Line 177-185)

      Lines 335-339: "We assumed the force was applied along phalanx IV and that there was no medial or lateral movement of the centre of pressure (CoP)". I'm confused, did the authors not measure CoP location with respect to the kangaroo limb? If not, this simple estimation undermines primary results (EMA analyses).

      We have changed "The anterior or posterior movement of the CoP was recorded by the force plate" to read: "The fore-aft movement of the CoP was recorded by the force plate within the motion capture coordinate system" (Line 406-407) and added more justification for fixing the CoP movement in the other axis: “It was necessary to assume the CoP was fixed in the mediallateral axis because when two feet land on the force plate, the lateral forces on each foot are not recorded, and indeed cancel if the forces are symmetrical (i.e. if the kangaroo is hopping in a straight path and one foot is not in front of the other). We only used symmetrical trials to ensure reliable measures of the anterior-posterior movement of the CoP.” (Line 408-413)

      The introduction makes many assertions about the generalities of locomotion and the relationship between mechanics and energetics. I'm afraid that the authors are selectively choosing references without thoroughly evaluating alternative theories. For example, Taylor, Kram, & others have multiple papers suggesting that decreasing EMA and increasing muscle force (and active muscle volume) increase metabolic costs during terrestrial locomotion. Rather, the authors suggest that decreasing EMA and increasingly high muscle force at faster speeds don't affect energetics unless muscle work increases substantially (paragraph 2)? If I am following correctly, does this theory conflict with active muscle volume ideas that are peppered throughout this manuscript?

      Yes, as you point out, the same mechanism does lead to different results in kangaroos vs humans, for instance, but this is not a contradiction. In all species, decreasing EMA will result in an increase in muscle force due to less efficient leverage (i.e. lower EMA) of the muscles, and the muscle-tendon unit will be required to produce more force to balance the joint moment. As a consequence, human muscles activate a greater volume in order for the muscle-tendon unit to increase muscle work and produce enough force. We are proposing that in kangaroos, the increase in work is done by the achilles tendon rather than the muscles. Previous research suggests that macropod ankle muscles contract isometrically or that the fibres do not shorten more at faster speeds i.e. muscle work does not increase with speed. Instead, the additional force seems to come from the tendon storing and subsequently returning more strain energy (indicated by higher stress). We found that the increase in tendon stress comes from higher ground force at faster speeds, and from it adopting a more crouched posture which increases the tendons’ stresses compared to an upright posture for a given speed (think of this as increasing the tendon’s stress capacity). We have substantially revised the discussion to highlight this.

      Similarly, does increased gross or net tendon mechanical energy storage & return improve hopping energetics? Would more tendon stress and strain energy storage with a given hysteresis value also dissipate more mechanical energy, requiring leg muscles to produce more net work? Does net or gross muscle work drive metabolic energy consumption?

      Based on the cost of generating force hypothesis, we think that gross muscle work would be linked to driving metabolic energy consumption. Our idea here is that the total body work is a product of the work done by the tendon and the muscle combined. If the tendon has the potential to do more work, then the total work can increase without muscle work needing to increase.

      The results interpret speed effects on biomechanics, but each kangaroo was only collected at 1 speed. Are inter-animal comparisons enough to satisfy this investigation?

      We have added a figure (Suppl Fig 9) to demonstrate the distribution of speed and number of trials per kangaroo. We have also removed "preferred" from the manuscript as this seems to cause confusion. Most kangaroos travelled at a range of “casual” speeds.

      Abstract: Can the authors more fully connect the concept of tendon stress and low metabolic rates during hopping across speeds? Surely, tendon mechanics don't directly drive the metabolic cost of hopping, but they affect muscle mechanics to affect energetics.

      Amended to: " This phenomenon may be related to greater elastic energy savings due to increasing tendon stress; however, the mechanisms which enable the rise in stress, without additional muscle work remain poorly understood." (Lines 25-27).

      The topic sentence in lines 61-63 may be misleading. The ensuing paragraph does not substantiate the topic sentence stating that ankle MTUs decouple speeds and energetics.

      We added "likely" to soften the statement. (Line 59)

      Lines 84-86: In humans, does more limb flexion and worse EMA necessitate greater active muscle volume? What about muscle contractile dynamics - See recent papers by Sawicki & colleagues that include Hill-type muscle mechanics in active muscle volume estimates.

      Added: “Smaller EMA requires greater muscle force to produce a given force on the ground, thereby demanding a greater volume of active muscle, and presumably greater metabolic rates than larger EMA for the same physiology”. (Line 80-82)

      Lines 106: can you give the context of what normal tendon safety factors are?

      Good idea. Added: "far lower than the typical safety factor of four to eight for mammalian tendons (Ker et al. 1988)." Line 106-107

      I thought EMA was relatively stable across speeds as per Biewener [Science & JAP '04]. However the authors gave an example of an elephant to suggest that it is typically inversely related to speed. Can the authors please explain the disconnect and the most appropriate explanation in this paragraph?

      Knee EMA in particular changed with speed in Biewener 2004. What is “typical” probably depends on the group of animals studied; e.g., cursorial quadrupedal mammals generally seem to maintain constant EMA, but other groups do not.

      These cases are presented to show a range of consequences for changing EMA (usually with mass, but sometimes with speed). We have made several adjustments to the paragraph to make this clearer. Lines 85-93.

      The results depend on the modeled internal moment arm (r). How confident are the authors in their little r prediction? Considering complications of joint mechanics in vivo including muscle bulging. Holzer et al. '20 Sci Rep demonstrated that different models of the human Achilles tendon moment arm predict vastly different relationships between the moment arm and joint angle.

      Our values for r and EMA closely align with previous papers which measured/calculate these values in kangaroos, such as Kram 1998, and thus we are confident in our interpretation.  

      This is a misleading results sentence: Small decreases in EMA correspond to a nontrivial increase in tendon stress, for instance, reducing EMA from 0.242 (mean minimum EMA of the slow group) to 0.206 (mean minimum EMA of the fast group) was associated with an ~18% increase in tendon stress. The authors could alternatively say that a ~15% decrease in EMA was associated with an ~18% increase in tendon stress, which seems pretty comparable.

      Thank you for pointing this out, it is important that it is made clearer. Although the change in relative magnitude is approximately the same (as it should be), this does not detract from the importance. The "small decrease in EMA" is referring to the absolute values, particularly in respect to the measurement error/noise. The difference is small enough to have been undetectable with other methods used in previous studies. We have amended the sentence to clarify this.

      It now reads: “Subtle decreases in EMA which may have been undetected in previous studies correspond to discernible increases in tendon stress. For instance, reducing EMA from 0.242 (mean minimum EMA of the slow group) to 0.206 (mean minimum EMA of the fast group) was associated with an increase in tendon stress from ~50 MPa to ~60 MPa, decreasing safety factor from 2 to 1.67 (where 1 indicates failure), which is both measurable and physiologically significant.” (Line 195-200)

      Lines 243-245: "The consistent net work observed among all speeds suggests the ankle extensors are performing similar amounts of ankle work independent of speed." If this is true, and presumably there is greater limb work performed on the center of mass at faster speeds (Donelan, Kram, Kuo), do more proximal leg joints increase work and energy consumption at faster speeds?

      The skin over the proximal leg joints (knee and hip) moves too much to get reliable measures of EMA from the ratio of moment arms. This will be pursued in future work when all muscles are incorporated in the model so knee and hip EMA can be determined from muscle force.

      We have added limitations and considerations paragraph to the manuscript: “Finally, we did not determine whether the EMA of proximal hindlimb joints (which are more difficult to track via surface motion capture markers) remained constant with speed. Although the hip and knee contribute substantially less work than the ankle joint (Fig. 4), the majority of kangaroo skeletal muscle is located around these proximal joints. A change in EMA at the hip or knee could influence a larger muscle mass than at the ankle, potentially counteracting or enhancing energy savings in the ankle extensor muscle-tendon units. Further research is needed to understand how posture and muscles throughout the whole body contribute to kangaroo energetics.” (Line 321-328)

      Lines 245-246: "Previous studies using sonomicrometry have shown that the muscles of tammar wallabies do not shorten considerably during hops, but rather act near-isometrically as a strut" Which muscles? All muscles? Extensors at a single joint?

      Added "gastrocnemius and plantaris" Line 164-165

      Lines 249-254: "The cost of generating force hypothesis suggests that faster movement speeds require greater rates of muscle force development, and in turn greater cross-bridge cycling rates, driving up metabolic costs (Taylor et al. 1980, Kram and Taylor 1990). The ability for the ankle extensor muscle fibres to remain isometric and produce similar amounts of work at all speeds may help explain why hopping macropods do not follow the energetic trends observed in quadrupedal species." These sentences confuse me. Kram & Taylor's cost of force-generating hypothesis assumes that producing the same average force over shorter contact times increases metabolic rate. How does 'similar muscle work' across all speeds explain the ability of macropods to use unique energetic trends in the cost of force-generating hypothesis context?

      Thank you for highlighting this confusion. We have substantially revised the discussion clarify where the mechanisms presented deviate from the cost of generating force hypothesis. Lines 270-309

      Reviewer #3 (Recommendations For The Authors):

      In addition to the points described in the public review, I have additional, related, specific comments:

      (1) Results: Please refer to the hypotheses in the results, and relate the the findings back to the hypotheses.

      We now relate the findings back to the hypotheses 

      Line 142 “In partial support of hypothesis (i), greater masses and faster speeds were associated with more crouched hindlimb postures (Fig. 3a,c).”.

      Lines 205-206: “The increase in tendon stress with speed, facilitated in part by the change in moment arms by the shift in posture, may explain changes in ankle work (c.f. Hypothesis (ii)).” 

      (2) Results: please provide the main statistical results either in-line or in a table in the main text.

      We (the co-authors) have discussed this at length, and have agreed that the manuscript is far more readable in the format whereby most statistics lie within the supplementary tables, otherwise a reader is met with a wall of statistics. We only include values in the main text when the magnitude is relevant to the arguments presented in the results and discussion.

      (3) Line 140: Describe how 'crouched' was defined.

      We have now added a brief definition of ‘Crouch factor’ after the figure caption. (Line 143) (Fig. 3a,c; where crouch factor is the ratio of total limb length to pelvis to toe distance).

      (4) Line 162: This seems to be a main finding and should be a figure in the main text not supplemental. Additionally, Supplementary Figures 3a and b do not show this finding convincingly There should be a figure plotting r vs speed and r vs mass.

      The combination of r and R are represented in the EMA plot in the main text. The r and R plots are relegated to the supplementary because the main text is already very crowded.  Thank you for the suggestion for the figure plotting r and R versus speed, this is now included as Suppl. Fig. 3h

      (5) Line 166: Supplementary Figure 3g does not show the range of dorsiflexion angles as a function of speed. It shows r vs dorsiflexion angle. Please correct.

      Thanks for noticing this, it was supposed to reference Fig 3g rather than Suppl Fig 3g in the sentence regarding speed. We have fixed this, Line 170. 

      We had added a reference to Suppl Fig 3 on Line 169 as this shows where the peak in r with ankle angle occurs (114.4 degrees).

      (6) Line 184: Where are the statistical results for this statement?

      The relationship between stress and EMA does not appear to be linear, thus we only present R<sup>^</sup>2 for the power relationship rather than a p-value. 

      (7) Line 192: The authors should explain how joint work and power relate/support the overall hypotheses. This section also refers to Figures 4 and 5 even though Figures 6 and 7 have already been described. Please reorganize.

      We have added a sentence at the end of the work and power section to mention hypothesis (ii) and lead into the discussion where it is elaborated upon. 

      “The increase in positive and negative ankle work may be due to the increase in tendon stress rather than additional muscle work.” Line 219-220 We have rearranged the figure order.

      (8) The statistics are not reported in the main text, but in the supplementary tables. If a result is reported in the main text, please report either in-line or with a table in the main text.

      We leave most statistics in the supplementary tables to preserve the readability of the manuscript. We only include values in the main text when the magnitude is relevant to the arguments raised in the results and discussion.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      Summary:

      This paper presents results from four independent experiments, each of which tests for rhythmicity in auditory perception. The authors report rhythmic fluctuations in discrimination performance at frequencies between 2 and 6 Hz. The exact frequency depends on the ear and experimental paradigm, although some frequencies seem to be more common than others.

      Strengths:

      The first sentence in the abstract describes the state of the art perfectly: "Numerous studies advocate for a rhythmic mode of perception; however, the evidence in the context of auditory perception remains inconsistent". This is precisely why the data from the present study is so valuable. This is probably the study with the highest sample size (total of > 100 in 4 experiments) in the field. The analysis is very thorough and transparent, due to the comparison of several statistical approaches and simulations of their sensitivity. Each of the experiments differs from the others in a clearly defined experimental parameter, and the authors test how this impacts auditory rhythmicity, measured in pitch discrimination performance (accuracy, sensitivity, bias) of a target presented at various delays after noise onset.

      Weaknesses:

      (1) The authors find that the frequency of auditory perception changes between experiments. I think they could exploit differences between experiments better to interpret and understand the obtained results. These differences are very well described in the Introduction, but don't seem to be used for the interpretation of results. For instance, what does it mean if perceptual frequency changes from between- to within-trial pitch discrimination? Why did the authors choose this experimental manipulation? Based on differences between experiments, is there any systematic pattern in the results that allows conclusions about the roles of different frequencies? I think the Discussion would benefit from an extension to cover this aspect.

      We believe that interpreting these differences remains difficult and a precise, detailed (and possibly mechanistic) interpretation is beyond the goal of the present study. The main goal of this study was to explore the consistency and variability of effects across variations of the experimental design and samples of participants. Interpreting specific effects, e.g. at particular frequencies, would make sense mostly if differences between experiments have been confirmed in a separate reproduction. Still, we do provide specific arguments for why differences in the outcome between different experiments, e.g. with and without explicit trial initialization by the participants, could be expected. See lines 91ff in the introduction and 786ff in the discussion.

      (2) The Results give the impression of clear-cut differences in relevant frequencies between experiments (e.g., 2 Hz in Experiment 1, 6 Hz in Exp 2, etc), but they might not be so different. For instance, a 6 Hz effect is also visible in Experiment 1, but it just does not reach conventional significance. The average across the three experiments is therefore very useful, and also seems to suggest that differences between experiments are not very pronounced (otherwise the average would not produce clear peaks in the spectrum). I suggest making this point clearer in the text.

      We have revised the conclusions to note that the present data do not support clear cut differences between experiments. For this reason we also refrain from detailed interpretations of specific effects, as suggested by this reviewer in point 1 above.

      (3) I struggle to understand the hypothesis that rhythmic sampling differs between ears. In most everyday scenarios, the same sounds arrive at both ears, and the time difference between the two is too small to play a role for the frequencies tested. If both ears operate at different frequencies, the effects of the rhythm on overall perception would then often cancel out. But if this is the case, why would the two ears have different rhythms to begin with? This could be described in more detail.

      This hypothesis was not invented by us, but in essence put forward in previous work. The study by Ho et al. CurrBiol 2017 has reported rhythmic effects at different frequencies in the left and right ears, and we here tried to reproduce these effects. One could speculate about an ear-difference based on studies reporting a right-ear advantage in specific listening tasks, and the idea that different time scales of rhythmic brain activity may be specifically prevail in the left and right cortical hemispheres; hence it does not seem improbable that there could be rhythmic effects in both ears at different frequencies. We note this in the introduction, l. 65ff.

      Reviewer #2 (Public review):

      Summary:

      The current study aims to shed light on why previous work on perceptual rhythmicity has led to inconsistent results. They propose that the differences may stem from conceptual and methodological issues. In a series of experiments, the current study reports perceptual rhythmicity in different frequency bands that differ between different ear stimulations and behavioral measures.

      The study suggests challenges regarding the idea of universal perceptual rhythmicity in hearing.

      Strengths:

      The study aims to address differences observed in previous studies about perceptual rhythmicity. This is important and timely because the existing literature provides quite inconsistent findings. Several experiments were conducted to assess perceptual rhythmicity in hearing from different angles. The authors use sophisticated approaches to address the research questions.

      Weaknesses:

      (1) Conceptional concerns:

      The authors place their research in the context of a rhythmic mode of perception. They also discuss continuous vs rhythmic mode processing. Their study further follows a design that seems to be based on paradigms that assume a recent phase in neural oscillations that subsequently influence perception (e.g., Fiebelkorn et al.; Landau & Fries). In my view, these are different facets in the neural oscillation research space that require a bit more nuanced separation. Continuous mode processing is associated with vigilance tasks (work by Schroeder and Lakatos; reduction of low frequency oscillations and sustained gamma activity), whereas the authors of this study seem to link it to hearing tasks specifically (e.g., line 694). Rhythmic mode processing is associated with rhythmic stimulation by which neural oscillations entrain and influence perception (also, Schroeder and Lakatos; greater low-frequency fluctuations and more rhythmic gamma activity). The current study mirrors the continuous rather than the rhythmic mode (i.e., there was no rhythmic stimulation), but even the former seems not fully fitting, because trials are 1.8 s short and do not really reflect a vigilance task. Finally, previous paradigms on phase-resetting reflect more closely the design of the current study (i.e., different times of a target stimulus relative to the reset of an oscillation). This is the work by Fiebelkorn et al., Landau & Fries, and others, which do not seem to be cited here, which I find surprising. Moreover, the authors would want to discuss the role of the background noise in resetting the phase of an oscillation, and the role of the fixation cross also possibly resetting the phase of an oscillation. Regardless, the conceptional mixture of all these facets makes interpretations really challenging. The phase-reset nature of the paradigm is not (or not well) explained, and the discussion mixes the different concepts and approaches. I recommend that the authors frame their work more clearly in the context of these different concepts (affecting large portions of the manuscript).

      Indeed, the paradigms used here and in many similar previous studies incorporate an aspect of phase-resetting, as the presentation of a background noisy may effectively reset ongoing auditory cortical processes. Studies trying to probe for rhythmicity in auditory perception in the absence any background noise have not shown any effect (Zoefel and Heil, 2013), perhaps because the necessary rhythmic processes along auditory pathways are only engaged when some sound is present. We now discuss these points, and also acknowledge the mentioned studies in the visual system; l. 57.

      (2) Methodological concerns:

      The authors use a relatively unorthodox approach to statistical testing. I understand that they try to capture and characterize the sensitivity of the different analysis approaches to rhythmic behavioral effects. However, it is a bit unclear what meaningful effects are in the study. For example, the bootstrapping approach that identifies the percentage of significant variations of sample selections is rather descriptive (Figures 5-7). The authors seem to suggest that 50% of the samples are meaningful (given the dashed line in the figure), even though this is rarely reached in any of the analyses. Perhaps >80% of samples should show a significant effect to be meaningful (at least to my subjective mind). To me, the low percentage rather suggests that there is not too much meaningful rhythmicity present. 

      We note that there is no clear consensus on what fraction of experiments should be expected or how this way of quantifying effects should be precisely valued (l. 441ff). However, we now also clearly acknowledge in the discussion that the effective prevalence is not very high (l. 663).

      I suggest that the authors also present more traditional, perhaps multi-level, analyses: Calculation of spectra, binning, or single-trial analysis for each participant and condition, and the respective calculation of the surrogate data analysis, and then comparison of the surrogate data to the original data on the second (participant) level using t-tests. I also thought the statistical approach undertaken here could have been a bit more clearly/didactically described as well.

      We here realize that our description of the methods was possibly not fully clear. We do follow the strategy as suggested by this reviewer, but rather than comparing actual and surrogate data based on a parametric t-test, we compare these based on a non-parametric percentile-based approach. This has the advantage of not making specific (and possibly not-warranted) assumptions about the distribution of the data. We have revised the methods to clarify this, l. 332ff. 

      The authors used an adaptive procedure during the experimental blocks such that the stimulus intensity was adjusted throughout. In practice, this can be a disadvantage relative to keeping the intensity constant throughout, because, on average, correct trials will be associated with a higher intensity than incorrect trials, potentially making observations of perceptual rhythmicity more challenging. The authors would want to discuss this potential issue. Intensity adjustments could perhaps contribute to the observed rhythmicity effects. Perhaps the rhythmicity of the stimulus intensity could be analyzed as well. In any case, the adaptive procedure may add variance to the data.

      We have added an analysis of task difficulty to the results (new section “Effects of adaptive task difficulty“) to address this. Overall we do not find systematic changes in task difficulty across participants for most of the experiments, but for sure one cannot rule out that this aspect of the design also affects the outcomes.  Importantly, we relied on an adaptive task difficulty to actually (or hopefully) reduce variance in the data, by keeping the task-difficulty around a certain level. Give the large number of trials collected, not using such an adaptive produce may result in performance levels around chance or near ceiling, which would make impossible to detect rhythmic variations in behavior. 

      Additional methodological concerns relate to Figure 8. Figures 8A and C seem to indicate that a baseline correction for a very short time window was calculated (I could not find anything about this in the methods section). The data seem very variable and artificially constrained in the baseline time window. It was unclear what the reader might take from Figure 8.

      This figure was intended mostly for illustration of the eye tracking data, but we agree that there is no specific key insight to be taken from this. We removed this. 

      Motivation and discussion of eye-movement/pupillometry and motor activity: The dual task paradigm of Experiment 4 and the reasons for assessing eye metrics in the current study could have been better motivated. The experiment somehow does not fit in very well. There is recent evidence that eye movements decrease during effortful tasks (e.g., Contadini-Wright et al. 2023 J Neurosci; Herrmann & Ryan 2024 J Cog Neurosci), which appears to contradict the results presented in the current study. Moreover, by appealing to active sensing frameworks, the authors suggest that active movements can facilitate listening outcomes (line 677; they should provide a reference for this claim), but it is unclear how this would relate to eye movements. Certainly, a person may move their head closer to a sound source in the presence of competing sound to increase the signal-to-noise ratio, but this is not really the active movements that are measured here. A more detailed discussion may be important. The authors further frame the difference between Experiments 1 and 2 as being related to participants' motor activity. However, there are other factors that could explain differences between experiments. Self-paced trials give participants the opportunity to rest more (inter-trial durations were likely longer in Experiment 2), perhaps affecting attentional engagement. I think a more nuanced discussion may be warranted.

      We expanded the motivation of why self-pacing trials may effectively alter how rhythmic processes affect perception, and now also allude to attention and expectation related effects (l. 786ff). Regarding eye movements we now discuss the results in the light of the previously mentioned studies, but again refrain from a very detailed and mechanistic interpretation (l. 782).

      Discussion:

      The main data in Figure 3 showed little rhythmicity. The authors seem to glance over this fact by simply stating that the same phase is not necessary for their statistical analysis. Previous work, however, showed rhythmicity in the across-participant average (e.g., Fiebelkorn's and similar work). Moreover, one would expect that some of the effects in the low-frequency band (e.g., 2-4 Hz) are somewhat similar across participants. Conduction delays in the auditory system are much smaller than the 0.25-0.5 s associated with 2-4 Hz. The authors would want to discuss why different participants would express so vastly different phases that the across-participant average does not show any rhythmicity, and what this would mean neurophysiologically.

      We now discussion the assumptions and implications of similar or distinct phases of rhythmic processes within and between participants (l. 695ff). In particular we note that different origins of the underlying neurophysiological processes eventually may suggest that such assumptions are or a not warranted.  

      An additional point that may require more nuanced discussion is related to the rhythmicity of response bias versus sensitivity. The authors could discuss what the rhythmicity of these different measures in different frequency bands means, with respect to underlying neural oscillations.

      We expanded discussion to interpret what rhythmic changes in each of the behavioral metric could imply (l. 706ff).

      Figures:

      Much of the text in the figures seems really small. Perhaps the authors would want to ensure it is readable even for those with low vision abilities. Moreover, Figure 1A is not as intuitive as it could be and may perhaps be made clearer. I also suggest the authors discuss a bit more the potential monoaural vs binaural issues, because the perceptual rhythmicity is much slower than any conduction delays in the auditory system that could lead to interference.

      We tried to improve the font sizes where possible, and discuss the potential monaural origins as suggested by other reviewers. 

      Reviewer #3 (Public review):

      Summary:

      The finding of rhythmic activity in the brain has, for a long time, engendered the theory of rhythmic modes of perception, that humans might oscillate between improved and worse perception depending on states of our internal systems. However, experiments looking for such modes have resulted in conflicting findings, particularly in those where the stimulus itself is not rhythmic. This paper seeks to take a comprehensive look at the effect and various experimental parameters which might generate these competing findings: in particular, the presentation of the stimulus to one ear or the other, the relevance of motor involvement, attentional demands, and memory: each of which are revealed to effect the consistency of this rhythmicity.

      The need the paper attempts to resolve is a critical one for the field. However, as presented, I remain unconvinced that the data would not be better interpreted as showing no consistent rhythmic mode effect. It lacks a conceptual framework to understand why effects might be consistent in each ear but at different frequencies and only for some tasks with slight variants, some affecting sensitivity and some affecting bias.

      Strengths:

      The paper is strong in its experimental protocol and its comprehensive analysis, which seeks to compare effects across several analysis types and slight experiment changes to investigate which parameters could affect the presence or absence of an effect of rhythmicity. The prescribed nature of its hypotheses and its manner of setting out to test them is very clear, which allows for a straightforward assessment of its results

      Weaknesses:

      There is a weakness throughout the paper in terms of establishing a conceptual framework both for the source of "rhythmic modes" and for the interpretation of the results. Before understanding the data on this matter, it would be useful to discuss why one would posit such a theory to begin with. From a perceptual side, rhythmic modes of processing in the absence of rhythmic stimuli would not appear to provide any benefit to processing. From a biological or homeostatic argument, it's unclear why we would expect such fluctuations to occur in such a narrow-band way when neither the stimulus nor the neurobiological circuits require it.

      We believe that the framework for why there may be rhythmic activity along auditory pathways that shapes behavioral outcomes has been laid out in many previous studies, prominently here (Schroeder et al., 2008; Schroeder and Lakatos, 2009; Obleser and Kayser, 2019). Many of the relevant studies are cited in the introduction, which is already rather long given the many points covered in this study. 

      Secondly, for the analysis to detect a "rhythmic mode", it must assume that the phase of fluctuations across an experiment (i.e., whether fluctuations are in an up-state or down-state at onset) is constant at stimulus onset, whereas most oscillations do not have such a total phase-reset as a result of input. Therefore, some theoretical positing of what kind of mechanism could generate this fluctuation is critical toward understanding whether the analysis is well-suited to the studied mechanism.

      In line with this and previous comments (by reviewer 2) we have expanded the discussion to consider the issue of phase alignment (l. 695ff). 

      Thirdly, an interpretation of why we should expect left and right ears to have distinct frequency ranges of fluctuations is required. There are a large number of statistical tests in this paper, and it's not clear how multiple comparisons are controlled for, apart from experiment 4 (which specifies B&H false discovery rate). As such, one critical method to identify whether the results are not the result of noise or sample-specific biases is the plausibility of the finding. On its face, maintaining distinct frequencies of perception in each ear does not fit an obvious conceptual framework.

      Again this point was also noted by another reviewer and we expanded the introduction and discussion in this regard (l. 65ff).

      Reviewer #1 (Recommendations for the authors):

      (1) An update of the AR-surrogate method has recently been published (https://doi.org/10.1101/2024.08.22.609278). I appreciate that this is a lot of work, and it is of coursee up to the authors, but given the higher sensitivity of this method, it might be worth applying it to the four datasets described here.

      Reading this article we note that our implementation of the AR-surrogate method was essentially as suggested here, and not as implemented by Brookshire. In fact we had not realized that Brookshire had apparently computed the spectrum based on the group-average data. As explained in the Methods section, as now clarified even better, we compute for each participant the actual spectrum of this participant’s data, and a set of surrogate spectra. We then perform a group-average of both to compute the p-value of the actual group-average based on the percentile of the distribution of surrogate averages. This send step differs from Harris & Beale, which used a one-sided t-test. The latter is most likely not appropriate in a strict statistical sense, but possibly more powerful for detecting true results compared to the percentile-based approach that we used (see l. 332ff).

      (2) When results for the four experiments are reported, a reminder for the reader of how these experiments differ from each other would be useful.

      We have added this in the Results section.

      "considerable prevalence of differences around 4Hz, with dual‐task requirements leading to stronger rhythmicity in perceptual sensitivity". There is a striking similarity to recently published data (https://doi.org/10.1101/2024.08.10.607439 ) demonstrating a 4-Hz rhythm in auditory divided attention (rather than between modalities as in the present case). This could be a useful addition to the paragraph.

      We have added a reference to this preprint, and additional previous work pointing in the same direction mentioned in there.  

      (3) There are two typos in the Introduction: "related by different from the question", and below, there is one "presented" too much.

      These have been fixed.

      Reviewer #3 (Recommendations for the authors):

      My major suggestion is that these results must be replicated in a new sample. I understand this is not simple to do and not always possible, but at this point, no effect is replicated from one experiment to the next, despite very small changes in protocol (especially experiment 1 vs 2). It's therefore very difficult to justify explaining the different effects as real as opposed to random effects of this particular sample. While the bootstrapping effects show the level of consistency of the effect within the sample studied, it can not be a substitute for a true replication of the results in a new sample.

      We agree that only an independent replication can demonstrate the robustness of the results. We do consider experiment 1 a replication test of Ho et al. CurrBiol 2017, which results in different results than reported there. But more importantly, we consider the analysis of ‘reproducibility’ by simulating participant samples a key novelty of the present work, and want to emphasize this over the within-study replication of the same experiment.  In fact, in light of the present interpretation of the data, even a within-study replication would most likely not offer a clear-cut answer. 

      As I said in the public review, the interpretation of the results, and of why perceptual cycles in arhythmic stimuli could be a plausible theory to begin with, is lacking. A conceptual framework would vastly improve the impact and understanding of the results.

      We tried to strengthen the conceptual framework in the introduction. We believe that this is in large provided by previous work, and the aim of the present study was to explore the robustness of effects and not to suggest and discover novel effects. 

      Minor comments:

      (1) The authors adapt the difficulty as a function of performance, which seems to me a strange choice for an experiment that is analyzing the differences in performance across the experiment. Could you add a sentence to discuss the motivation for this choice?

      We now mention the rationale in the Methods section and in a new section of the Results. There we also provide additional analyses on this parameter.

      (2) The choice to plot the p-values as opposed to the values of the actual analysis feels ill-advised to me. It invites comparison across analyses that isn't necessarily fair. It would be more informative to plot the respective analysis outputs (spectral power, regression, or delta R2) and highlight the windows of significance and their overlap across analyses. In my opinion, this would be more fair and accurate depiction of the analyses as they are meant to be used.

      We do disagree. As explained in the Methods (l. 374ff): “(Showing p-values) … allows presenting the results on a scale that can be directly compared between analysis approaches, metrics, frequencies and analyses focusing on individual ears or the combined data. Each approach has a different statistical sensitivity, and the underlying effect sizes (e.g. spectral power) vary with frequency for both the actual data and null distribution. As a result, the effect size reaching statistical significance varies with frequency, metrics and analyses.” 

      The fact that the level of power (or R2 or whatever metric we consider) required to reach significance differs between analyses (one ear, both ears), metrics (d-prime, bias, RT) and between analyses approaches makes showing the results difficult, as we would need a separate panel for each of those. This would multiply the number of panels required e.g. for Figure 4 by 3, making it a figure with 81 axes. Also neither the original quantities of each analysis (e.g. spectral power) nor the p-values that we show constitute a proper measure of effect size in a statistical sense. In that sense, neither of these is truly ideal for comparing between analyses, metrics etc. 

      We do agree thought that many readers may want to see the original quantification and thresholds for statistical significance. We now show these in an exemplary manner for the Binned analysis of Experiment 1, which provides a positive result and also is an attempt to replicate the findings by  Ho et al 2017. This is shown in new Figure 5. 

      (3) Typo in line 555 (+ should be plus minus).

      (4) Typo in line 572: "Comparison of 572 blocks with minus dual task those without"

      (5) Typo in line 616: remove "one".

      (6) Line 666 refers to effects in alpha band activity, but it's unclear what the relationship is to the authors' findings, which peak around 6 Hz, lower than alpha (~10 Hz).

      (7) Line 688 typo, remove "amount of".

      These points have been addressed.  

      (8) Oculomotor effect that drives greater rhythmicity at 3-4 Hz. Did the authors analyze the eye movements to see if saccades were also occurring at this rate? It would be useful to know if the 3-4 Hz effect is driven by "internal circuitry" in the auditory system or by the typical rate of eye movement.

      A preliminary analysis of eye movement data was in previous Figure 8, which was removed on the recommendation of another review.  This showed that the average saccade rate is about 0.01 saccade /per trial per time bin, amounting to on average less than one detected saccade per trial. Hence rhythmicity in saccades is unlikely to explain rhythmicity in behavioral data at the scale of 34Hz. We now note this in the Results.

      Obleser J, Kayser C (2019) Neural Entrainment and Attentional Selection in the Listening Brain. Trends Cogn Sci 23:913-926.

      Schroeder CE, Lakatos P (2009) Low-frequency neuronal oscillations as instruments of sensory selection. Trends Neurosci 32:9-18.

      Schroeder CE, Lakatos P, Kajikawa Y, Partan S, Puce A (2008) Neuronal oscillations and visual amplification of speech. Trends Cogn Sci 12:106-113.

      Zoefel B, Heil P (2013) Detection of Near-Threshold Sounds is Independent of EEG Phase in Common Frequency Bands. Front Psychol 4:262.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      Summary:

      This is an interesting study characterizing and engineering so-called bathy phytochromes, i.e., those that respond to near infrared (NIR) light in the ground state, for optogenetic control of bacterial gene expression. Previously, the authors have developed a structure-guided approach to functionally link several light-responsive protein domains to the signaling domain of the histidine kinase FixL, which ultimately controls gene expression. Here, the authors use the same strategy to link bathy phytochrome light-responsive domains to FixL, resulting in sensors of NIR light. Interestingly, they also link these bathy phytochrome light-sensing domains to signaling domains from the tetrathionate-sensing SHK TtrS and the toluene-sensing SHK TodS, demonstrating the generality of their protein engineering approach more broadly across bacterial two-component systems.

      This is an exciting result that should inspire future bacterial sensor design. They go on to leverage this result to develop what is, to my knowledge, the first system for orthogonally controlling the expression of two separate genes in the same cell with NIR and Red light, a valuable contribution to the field.

      Finally, the authors reveal new details of the pH-dependent photocycle of bathy phytochromes and demonstrate that their sensors work in the gut - and plant-relevant strains E. coli Nissle 1917 and A. tumefaciens.

      Strengths:

      (1) The experiments are well-founded, well-executed, and rigorous.

      (2) The manuscript is clearly written.

      (3) The sensors developed exhibit large responses to light, making them valuable tools for ontogenetic applications.

      (4) This study is a valuable contribution to photobiology and optogenetics.

      We thank the reviewer for the positive verdict on our manuscript.

      Weaknesses:

      (1) As the authors note, the sensors are relatively insensitive to NIR light due to the rapid dark reversion process in bathy phytochromes. Though NIR light is generally non-phototoxic, one would expect this characteristic to be a limitation in some downstream applications where light intensities are not high (e.g., in vivo).

      We principally concur with this reviewer’s assessment that delivery of light (of any color) into living tissue can be severely limited by absorption, reflection, and scattering. That notwithstanding, at least two considerations suggest that in-vivo deployment of the pNIRusk setups we presently advance may be feasible.

      First, while the pNIRusk setups are indeed less light-sensitive compared to, e.g., our earlier redlight-responsive pREDusk and pDERusk setups (see Meier et al. Nat Commun 2024), we note that the overall light fluences required for triggering them are in the range of tens of µW per cm<sub>2</sub>. By contrast, optogenetic experiments in vivo, in particular in the neurosciences, often employ light area intensities on the order of mW per cm<sub>2</sub> and above. Put another way, compared to the optogenetic tools used in these experiments, the pNIRusk setups are actually quite sensitive to light.

      Second, sensitivity to NIR light brings the advantage of superior tissue penetration, see data reported by Weissleder Nat Biotech 2001 and Ash et al. Lasers Med Sci 2017 (both papers are cited in our manuscript). Based on these data, the intensity of blue light (450 nm) therefore falls off 5-10 times more strongly with penetration depth than that of NIR light (800 nm).

      We have added a brief treatment of these aspects in the Discussion section.

      (2) Though they can be multiplexed with Red light sensors, these bathy phytochrome NIR sensors are more difficult to multiplex with other commonly used light sensors (e.g., blue) due to the broad light responsivity of the Pfr state. This challenge may be overcome by careful dosing of blue light, as the authors discuss, but other bacterial NIR sensing systems with less cross-talk may be preferred in some applications.

      The reviewer is correct in noting that, at least to a certain extent, the pNIRusk systems also respond to blue light owing to their Soret absorbance bands (see Fig. 1). That said, we note two points:

      First, a given photoreceptor that preferentially responds to certain wavelengths, e.g., 700 nm in the case of conventional bacterial phytochromes (BphP), generally absorbs shorter wavelengths to some degree as well. Absorption of these shorter wavelengths suffices for driving electronic and/or vibronic transitions of the chromophore to higher energy levels which often give rise to productive photochemistry and downstream signal transduction. Put another way, a certain response of sensory photoreceptors to shorter wavelengths is hence fully expected and indeed experimentally borne out, as for instance shown by Ochoa-Fernandez et al. in the so-called PULSE setup (Nat Meth 2020, doi: 10.1038/s41592-020-0868-y).

      Second, known BphPs share similar Pr and Pfr absorbance spectra. We therefore expect other BphP-based optogenetic setups to also respond to blue light to some degree. Currently, there are insufficient data to gauge whether individual BphPs systematically differ in their relative sensitivity to blue compared to red or NIR light. Arguably, pertinent experiments may be an interesting subject for future study.

      Reviewer #2 (Public review):

      Summary:

      In this manuscript, Meier et al. engineer a new class of light-regulated two-component systems. These systems are built using bathy-bacteriophytochromes that respond to near-infrared (NIR) light. Through a combination of genetic engineering and systematic linker optimization, the authors generate bacterial strains capable of selective and tunable gene expression in response to NIR stimulation. Overall, these results are an interesting expansion of the optogenetic toolkit into the NIR range. The cross-species functionality of the system, modularity, and orthogonality have the potential to make these tools useful for a range of applications.

      Strengths:

      (1) The authors introduce a novel class of near-infrared light-responsive two-component systems in bacteria, expanding the optogenetic toolbox into this spectral range.

      (2) Through engineering and linker optimization, the authors achieve specific and tunable gene expression, with minimal cross-activation from red light in some cases.

      (3) The authors show that the engineered systems function robustly in multiple bacterial strains, including laboratory E. coli, the probiotic E. coli Nissle 1917, and Agrobacterium tumefaciens.

      (4) The combination of orthogonal two-component systems can allow for simultaneous and independent control of multiple gene expression pathways using different wavelengths of light.

      (5) The authors explore the photophysical properties of the photosensors, investigating how environmental factors such as pH influence light sensitivity.

      Weaknesses:

      (1) The expression of multi-gene operons and fluorescent reporters could impose a metabolic burden. The authors should present data comparing optical density for growth curves of engineered strains versus the corresponding empty-vector control to provide insight into the burden and overall impact of the system on host viability and growth.

      In response to this comment, we have recorded growth kinetics of bacteria harboring the pNIRusk-DsRed plasmids or empty vectors under both inducing (i.e., under NIR light) and noninducing conditions (i.e., darkness). We did not observe systematic differences in the growth kinetics between the different cultures, thus suggesting that under the conditions tested there is no adverse effect on cell viability.

      We include the new data in Suppl. Fig. 5c-d and refer to them in the main text.

      (2) The manuscript consistently presents normalized fluorescence values, but the method of normalization is not clear (Figure 2 caption describes normalizing to the maximal fluorescence, but the maximum fluorescence of what?). The authors should provide a more detailed explanation of how the raw fluorescence data were processed. In addition, or potentially in exchange for the current presentation, the authors should include the raw fluorescence values in supplementary materials to help readers assess the actual magnitude of the reported responses.

      We appreciate this valid comment and have altered the representation of the fluorescence data. All values for a given fluorescent protein (i.e., either DsRed or YPet) across all systems are now normalized to a single reference value, thus enabling direct comparison between experiments.

      (3) Related to the prior point, it would be useful to have a positive control for fluorescence that could be used to compare results across different figure panels.

      As all data are now normalized to the same reference value, direct comparison across all figures is enabled.

      (4) Real-time gene expression data are not presented in the current manuscript, but it would be helpful to include a time-course for some of the key designs to help readers assess the speed of response to NIR light.

      In response to this comment, we include in the revised manuscript induction kinetics of bacterial cultures bearing pNIRusk upon transfer to inducing NIR-light conditions. To this end, aliquots were taken at discrete timepoints, transcriptionally and translationally arrested, and analyzed for optical density and DsRed reporter fluorescence after allowing for chromophore maturation.

      We include the new data in Suppl. Fig. 5e and refer to them in the manuscript.

      Moreover, we note that the experiments in Agrobacterium tumefaciens used a luciferase reporter thus enabling the continuous monitoring of the light-induced expression kinetics. These data (unchanged in revision) are to be found in Suppl. Fig. 9.

      Reviewer #3 (Public review):

      Summary:

      This paper by Meier et al introduces a new optogenetic module for the regulation of bacterial gene expression based on "bathy-BphP" proteins. Their paper begins with a careful characterization of kinetics and pH dependence of a few family members, followed by extensive engineering to produce infrared-regulated transcriptional systems based on the authors' previous design of the pDusk and pDERusk systems, and closing with characterization of the systems in bacterial species relevant for biotechnology.

      Strengths:

      The paper is important from the perspective of fundamental protein characterization, since bathyBphPs are relatively poorly characterized compared to their phytochrome and cyanobacteriochrome cousins. It is also important from a technology development perspective: the optogenetic toolbox currently lacks infrared-stimulated transcriptional systems. Infrared light offers two major advantages: it can be multiplexed with additional tools, and it can penetrate into deep tissues with ease relative to the more widely used blue light-activated systems. The experiments are performed carefully, and the manuscript is well written.

      Weaknesses:

      My major criticism is that some information is difficult to obtain, and some data is presented with limited interpretation, making it difficult to obtain intuition for why certain responses are observed. For example, the changes in red/infrared responses across different figures and cellular contexts are reported but not rationalized. Extensive experiments with variable linker sequences were performed, but the rationale for linker choices was not clearly explained. These are minor weaknesses in an overall very strong paper.

      We are grateful for the positive take on our manuscript.

      Reviewer #1 (Recommendations for the authors):

      (1) As eLife is a broad audience journal, please define the Soret and Q-bands (line 125).

      We concur and have added labels in fig. 1a that designate the Soret and Q bands.

      (2) The initial (0) Ac design in Figure 2b is activated by NIR and Red light, albeit modestly. The authors state that this construct shows "constant reporter fluorescence, largely independent of illumination" (line 167). This language should be changed to reflect the fact that this Ac construct responds to both of these wavelengths.

      Agreed. We have amended the text accordingly.

      (3) pNIRusk Ac 0 appears to show a greater light response than pNIRusk Av -5. However, the authors claim that the former is not light-responsive and the latter is. This conclusion should be explained or changed.

      The assignment of pNIRusk Av-5 as light-responsive is based on the relative difference in reporter fluorescence between darkness and illumination with either red or NIR light. Although the overall fluorescence is much lower in Av-5 than for Av-0, the relative change upon illumination is much more pronounced. We add a statement to this effect to the text.

      (4) The authors state that "when combining DmDERusk-Str-YPet with AvTod+21-DsRed expression rose under red and NIR light, respectively, whereas the joint application of both light colors induced both reporter genes" (lines 258-261). In contrast, Figure 3c shows that application of both wavelengths of light results in exclusive activation of YPet expression. It appears the description of the data is wrong and must be corrected. That said, this error does not impact their conclusion that two separate target genes can be independently activated by NIR and red light.

      We thank the reviewer for catching this error which we have corrected in the revised manuscript.

      (5) Line 278: I don't agree with the authors' blanket statement that the use of upconversion nanoparticles is a "grave" limitation for NIR-light mediated activation of bacterial gene expression in vivo. The authors should either expound on the severity of the limitation or use more moderate language.

      We have replaced the word ‘grave’ by ‘potential’ and thereby toned down our wording.

      Reviewer #2 (Recommendations for the authors):

      (1) Please include a discussion on the expected depth penetration of different light wavelengths. This is most relevant in the context of the discussion about how these NIR systems could be used with living therapeutics.

      Given the heterogeneity of biological tissue, it is challenging to state precise penetration depths for different wavelengths of light. That said, blue light for instance is typically attenuated by biological tissue around 5 to 10 times as strongly as near-infrared light is.

      We have expanded the Discussion chapter to cover these aspects.

      (2) It would be helpful for Figure 2C (or supplementary) to also include the response to blue light stimulation.

      We agree and have acquired pertinent data for the blue-light response. The new data are included in an updated Fig. 2c. Data acquired at varying NIR-light intensities, originally included in Fig. 2c, have been moved to Suppl. Fig. 5a-b.

      (3) In Figure 4A, data on the response of E. coli Nissle to blue and red light are missing. Including this would help identify whether the reduced sensitivity to non-NIR wavelengths observed in the E. coli lab strain is preserved in the probiotic background.

      In response to this comment, we have acquired pertinent data on E. coli Nissle. While the results were overall similar to those in the laboratory strain, the response to blue and NIR light was yet lower in the Nissle bacteria which stands to benefit optogenetic applications.

      We have updated Fig. 4a accordingly. For clarity, we only show the data for AvNIRusk in the main paper but have relegated the data on AcNIRusk to Suppl. Fig. 8. (Note that this has necessitated a renumbering of the subsequent Suppl. Figs.)

      (4) On many of the figures, there are thin gray lines that appear between the panels that it would be nice to eliminate because, in some cases, they cut through words and numbers.

      The grey lines likely arose from embedding the figures into the text document. In the typeset manuscript, which has become available on the eLife webpage in the meantime, there are no such lines. That said, we will carefully check throughout the submission/publishing/proofing process lest these lines reappear.

      (5) Page 7, line 155: "As not least seen" typo or awkward phrasing.

      We have restructured the sentence and thereby hopefully clarified the unclear phrasing.

      (6) Page 7, line 167: It does not appear to be the case that the initial pNIRusk designs show constant fluorescence that is largely independent of illumination. AcNIRusk shows an almost twofold change from dark to NIR. Reword this to avoid confusion.

      We concur with this comment, similar to reviewer #1’s remark, and have adjusted the text accordingly.

      (7) Page 8, line 174: Related to the previous point, AvNIRusk has one design that is very minimally light switchable (-5), so stating that six light switchable designs have been identified is also confusing.

      As stated in our response to reviewer #1 above, the assignment of AvNIRusk-5 as light-switchable is based on the relative fluorescence change upon illumination. We have added an explanation to the text.

      (8) Page 10, line 228-229: I was not able to find the data showing that expression levels were higher for the DmTtr systems than the pREDusk and pNIRusk setups. This may be an issue related to the normalization point. It was not clear to me how to compare these values.

      We apologize for the initially unclear representation of the data. In response to this reviewer’s general comments above, we have now normalized all fluorescence values to a single reference value, thus allowing their direct comparison.

      (9) Page 12, line 264: "finer-grained expression control can be exerted..." Either show data or adjust the language so that it is clear this is a prediction.

      True, we have replaced ‘can’ by ‘could’.

      (10) Page 25, line 590: CmpX13 cells have a reference that is given later, but it should be added where it first appears.

      Agreed, we have added the reference in the indicated place.

      (11) Page 25, line 592: define LB/Kan.

      We had already defined this abbreviation further up but, for clarity, we have added it again in the indicated position.

      (12) Page 40, line 946: "normalized by" rather than "to".

      We have implemented the requested change in the indicated and several other positions of the manuscript.

      (13) Figures 2C, 3C, and similar plots in the supplementary material would benefit from having a legend for the colors.

      We agree and have added pertinent legends to the corresponding main and supplementary figures.

      (14) As a reader, I had some trouble following all the acronyms. This is at the author's discretion, but I would eliminate ones that are not strictly essential (e.g. MTP for microtiter plate; I was unable to identify what "MCS" meant; look for other opportunities to remove acronyms).

      In the revised manuscript, we have defined the abbreviation ‘MCS’ (for ‘multiple-cloning site’) upon first occurrence. We have decided to retain the abbreviation ‘MTP’ in the text.

      (15) Could the authors briefly speculate on why A. tumefaciens activation with red light might occur?

      While we can but speculate as to the underlying reasons for the divergent red-light response in A. tumefaciens, we discuss possible scenarios below.

      Commonly, two-component systems (TCS) exhibit highly cooperative and steep responses to signal. As a consequence, even small differences in the intracellular amounts of phosphorylated and unphosphorylated response regulator (RR) can give to significantly changed gene-expression output. Put another way, the gene-expression output need not scale linearly with the extent of RR phosphorylation but, rather, is expected to show nonlinear dependence with pronounced thresholding effects.

      Differences in the pertinent RR levels can for instance arise from variations in the expression levels of the pNIRusk system components between E. coli and A. tumefaciens. Moreover, the two bacteria greatly differ in their two-component-system (TCS) repertoire. Although TCSs are commonly well insulated from each other, cross-talk with endogenous TCSs, even if limited, may cause changes in the levels of phosphorylated RR and hence gene-expression output. In a similar vein, the RR can also be phosphorylated and dephosphorylated non-enzymatically, e.g., by reaction with high-energy anhydrides (such as acetyl phosphate) and hydrolysis, respectively. Other potential origins for the divergent red-light response include differences in the strength of the promoters driving expression of the pNIRusk system components and the fluorescent/luminescent reporters, respectively.

      (16) It would be helpful for the authors to briefly explain why they needed to switch to luminescence from fluorescence for the A. tumeraciens studies.

      While there was no strict necessity to switch from the fluorescence-based system used in E. coli to a luminescence-based system in A. tumefaciens, we opted for luminescence based on prior experience with other Alphaproteobacteria (e.g., 10.1128/mSystems.00893-21), where luminescence offered significant advantages. Specifically, it provides essentially background-free signal detection and greater sensitivity for monitoring gene expression. In addition, as demonstrated in Suppl. Fig. 9c and d, the luminescence system enables real-time tracking of gene expression dynamics, which further supported its use in our experimental setup (see our response to reviewer #2’s general comments).

      (17) This is a very minor comment that the authors can take or leave, but I got hung up on the word "implement" when it appeared a few times in the manuscript because I tended to read it as "put a plan into place" rather than its other meaning.

      In the abstract, we have replaced one instance of the word ‘implement’ by ‘instrument’.

      (18) The authors should include the relevant constructs on AddGene or another public strainsharing service.

      We whole-heartedly subscribe to the idea of freely sharing research materials with fellow scientists. Therefore, we had already deposited the most relevant AvNIRusk in Addgene, even prior to the initial submission of the manuscript (accession number 235084). In the meantime, we have released the deposition, and the plasmid can be obtained from Addgene since May 15<sub>th</sub> of this year.

      Reviewer #3 (Recommendations for the authors):

      Suggestion for improvement:

      This paper relies heavily on variations in linker sequences to shift responses. I am familiar with prior work from the Moglich lab in which helical linkers were employed to shift responses in synthetic two-component systems, with interesting periodicity in responses with every 7 residues (as expected for an alpha helix) and inversion of responses at smaller linker shifts. There is no mention in this paper whether their current engineering follows a similar rationale, what types of linkers are employed (e.g. flexible vs helical), and whether there is an interpretation for how linker lengths alter responses. Can you explain what classes of linker sequences are used throughout Figures 2 and 3, and whether length or periodicity affects the outcome? This would be very helpful for readers who are new to this approach, or if the rationale here differs from the authors' prior work.

      The PATCHY approach employed at present followed a closely similar rationale as in our previous studies. That is, linkers were extended/shortened and varied in their sequence by recombining different fragments of the natural linkers of the parental receptors, i.e., the bacteriophytochrome and the FixL sensor histidine kinase, respectively. We have added a statement to this effect in the text and a reference to Suppl. Fig. 3 which illustrates the principal approach.

      Compared to our earlier studies, we isolated fewer receptor variants supporting light-regulated responses, despite covering a larger sequence space. Owing to the sparsity of the light-regulated variants, an interpretation of the linker properties and their correlation with light-regulated activity is challenging. Although doubtless unsatisfying from a mechanistic viewpoint, we therefore refrain from a pertinent discussion which would be premature and speculative at this point. As the reviewer raises a valid and important point, we have expanded the text by referring to our earlier studies and the observed dependence of functional properties on linker composition.

      It is sometimes difficult to intuit or rationalize the differences in red/IR sensitivity across closely related variants. An important example appears in Figure 3C vs 3B. I think the AvTod+21 in 3B should be the equivalent to the DsRed response in the second column of 3C (AvTod+21 + DmDERusk), except, of course, that the bacteria in 3C carry an additional plasmid for the DERusk system. However, in 3B, the response to red light is substantial - ~50% as strong as that for IR, whereas in 3C, red light elicits no response at all. What is the difference? The reason this is important is that the AvTod+21 and DMDERusk represent the best "orthogonal" red and infrared light responses, but this is not at all obvious from 3B, where AvTod+21 still causes a substantial (and for orthogonality, undesirable) response under red light. Perhaps subtle differences in expression level due to plasmid changes cause these differences in light responses? Could the authors test how the expression level affects these responses? The paper would be greatly improved if observations of the diverse red/IR responses could be rationalized by some design criteria.

      As noted above in our response to reviewer #2, we have now normalized all fluorescence readings to joint reference values, thus allowing a better comparison across experiments.

      The reviewer is correct in noting that upon multiplexing, the individual plasmid systems support lower fluorescence levels than when used in isolation. We speculate that the combination of two plasmids may affect their copy numbers (despite the use of different resistance markers and origins of replications) and hence their performance. Likewise, the cellular metabolism may be affected when multiple plasmids are combined. These aspects may well account for the absent red-light response in AvTod+21 in the multiplexing experiments which is – indeed – unexpected. As, at present, we cannot provide a clear rationalization for this effect, we recommend verifying the performance of the plasmid setups when multiplexing.

      The paper uses "red" and "infrared" to refer to ~624 nm and ~800 nm light, respectively. I wonder whether it might be possible to shift these peak wavelengths to obtain even better separation for the multiplexing experiments. Perhaps shifting the specific red wavelength could result in better separation between DERusk and AvTod systems, for example? Could the authors comment on this (maybe based on action spectra of their previously developed tools) or perhaps test a few additional stimulation wavelengths?

      The choice of illumination wavelengths used in these experiments is dictated by the LED setups available for illumination of microtiter plates. On the one hand, we are using an SMD (surface-mount device) three-color LED with a fixed wavelength of the red channel around 624 nm (see Hennemann et al., 2018). On the other hand, we are deploying a custom-built device with LEDs emitting at around 800 nm (see Stüven et al., 2019 and this work). Adjusting these wavelengths is therefore challenging, although without doubt potentially interesting.

      To address this reviewer comment, we have added a statement to the text that the excitation wavelengths may be varied to improve multiplexed applications.

      Additional minor comments:

      (1) Figure 2C: It would be very helpful to place a legend on the figure panel for what the colors indicate, since they are unique to this panel and non-intuitive.

      This comment coincides with one by reviewer #2, and we have added pertinent legends to this and related supplementary figures.

      (2) Figure 3C: it is not obvious which system uses DsRed and which uses YPet in each combination, since the text indicates that all combinations were cloned, and this is not clearly described in the legend. Is it always the first construct in the figure legend listed for DsRed and the second for YPet?

      For clarification, we have revised the x-axis labels in Fig. 3C. (And yes, it is as this reviewer surmises: the first of the two constructs harbored DsRed and the second one YPet.)

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      This is an interesting study of the nature of representations across the visual field. The question of how peripheral vision differs from foveal vision is a fascinating and important one. The majority of our visual field is extra-foveal yet our sensory and perceptual capabilities decline in pronounced and well-documented ways away from the fovea. Part of the decline is thought to be due to spatial averaging (’pooling’) of features. Here, the authors contrast two models of such feature pooling with human judgments of image content. They use much larger visual stimuli than in most previous studies, and some sophisticated image synthesis methods to tease apart the prediction of the distinct models.

      More importantly, in so doing, the researchers thoroughly explore the general approach of probing visual representations through metamers-stimuli that are physically distinct but perceptually indistinguishable. The work is embedded within a rigorous and general mathematical framework for expressing equivalence classes of images and how visual representations influence these. They describe how image-computable models can be used to make predictions about metamers, which can then be compared to make inferences about the underlying sensory representations. The main merit of the work lies in providing a formal framework for reasoning about metamers and their implications, for comparing models of sensory processing in terms of the metamers that they predict, and for mapping such models onto physiology. Importantly, they also consider the limits of what can be inferred about sensory processing from metamers derived from different models.

      Overall, the work is of a very high standard and represents a significant advance over our current understanding of perceptual representations of image structure at different locations across the visual field. The authors do a good job of capturing the limits of their approach and I particularly appreciated the detailed and thoughtful Discussion section and the suggestion to extend the metamer-based approach described in the MS with observer models. The work will have an impact on researchers studying many different aspects of visual function including texture perception, crowding, natural image statistics, and the physiology of low- and mid-level vision.

      The main weaknesses of the original submission relate to the writing. A clearer motivation could have been provided for the specific models that they consider, and the text could have been written in a more didactic and easy-to-follow manner. The authors could also have been more explicit about the assumptions that they make.

      Thank you for the summary. We appreciate the positives noted above. We address the weaknesses point by point below.

      Reviewer #2 (Public Review):

      Summary

      This paper expands on the literature on spatial metamers, evaluating different aspects of spatial metamers including the effect of different models and initialization conditions, as well as the relationship between metamers of the human visual system and metamers for a model. The authors conduct psychophysics experiments testing variations of metamer synthesis parameters including type of target image, scaling factor, and initialization parameters, and also compare two different metamer models (luminance vs energy). An additional contribution is doing this for a field of view larger than has been explored previously

      General Comments

      Overall, this paper addresses some important outstanding questions regarding comparing original to synthesized images in metamer experiments and begins to explore the effect of noise vs image seed on the resulting syntheses. While the paper tests some model classes that could be better motivated, and the results are not particularly groundbreaking, the contributions are convincing and undoubtedly important to the field. The paper includes an interesting Voronoi-like schematic of how to think about perceptual metamers, which I found helpful, but for which I do have some questions and suggestions. I also have some major concerns regarding incomplete psychophysical methodology including lack of eye-tracking, results inferred from a single subject, and a huge number of trials. I have only minor typographical criticisms and suggestions to improve clarity. The authors also use very good data reproducibility practices.

      Thank you for the summary. We appreciate the positives noted above. We address the weaknesses point by point below.

      Specific Comments

      Experimental Setup

      Firstly, the experiments do not appear to utilize an eye tracker to monitor fixation. Without eye tracking or another manipulation to ensure fixation, we cannot ensure the subjects were fixating the center of the image, and viewing the metamer as intended. While the short stimulus time (200ms) can help minimize eye movements, this does not guarantee that subjects began the trial with correct fixation, especially in such a long experiment. While Covid-19 did at one point limit in-person eye-tracked experiments, the paper reports no such restrictions that would have made the addition of eye-tracking impossible. While such a large-scale experiment may be difficult to repeat with the addition of eye tracking, the paper would be greatly improved with, at a minimum, an explanation as to why eye tracking was not included.

      Addressed on pg. 25, starting on line 658.

      Secondly, many of the comparisons later in the paper (Figures 9,10) are made from a single subject. N=1 is not typically accepted as sufficient to draw conclusions in such a psychophysics experiment. Again, if there were restrictions limiting this it should be discussed. Also (P11) Is subject sub-00 is this an author? Other expert? A naive subject? The subject’s expertise in viewing metamers will likely affect their performance.

      Addressed on pg. 14, starting on line 308.

      Finally, the number of trials per subject is quite large. 13,000 over 9 sessions is much larger than most human experiments in this area. The reason for this should be justified.

      In general, we needed a large number of trials to fit full psychometric functions for stimuli derived for both models, with both types of comparison, both initializations, and over many target images. We could have eliminated some of these, but feel that having a consistent dataset across all these conditions is a strength of the paper.

      In addition to the sentence on pg. 14, line 318, a full enumeration of trials is now described on pg. 23, starting on line 580.

      Model

      For the main experiment, the authors compare the results of two models: a ’luminance model’ that spatially pools mean luminance values, and an ’energy model’ that spatially pools energy calculated from a multi-scale pyramid decomposition. They show that these models create metamers that result in different thresholds for human performance, and therefore different critical scaling parameters, with the basic luminance pooling model producing a scaling factor 1/4 that of the energy model. While this is certain to be true, due to the luminance model being so much simpler, the motivation for the simple luminance-based model as a comparison is unclear.

      The use of simple models is now addressed on pg. 3, starting on line 98, as well as the sentence starting on pg. 4 line 148: the luminance model is intended as the simplest possible pooling model.

      The authors claim that this luminance model captures the response of retinal ganglion cells, often modeled as a center-surround operation (Rodieck, 1964). I am unclear in what aspect(s) the authors claim these center-surround neurons mimic a simple mean luminance, especially in the context of evidence supporting a much more complex role of RGCs in vision (Atick & Redlich, 1992). Why do the authors not compare the energy model to a model that captures center-surround responses instead? Do the authors mean to claim that the luminance model captures only the pooling aspects of an RGC model? This is particularly confusing as Figures 6 and 9 show the luminance and energy models for original vs synth aligning with the scaling of Midget and Parasol RGCs, respectively. These claims should be more clearly stated, and citations included to motivate this. Similarly, with the energy model, the physiological evidence is very loosely connected to the model discussed.

      We have removed the bars showing potential scaling values measured by electrophysiology in the primate visual system and attempted to clarify our language around the relationship between these models and physiology. Our metamer models are only loosely connected to the physiology, and we’ve decided in revision not to imply any direct connection between the model parameters and physiological measurements. The models should instead be understood as loosely inspired by physiology, but not as a tool to localize the representation (as was done in the Freeman paper).

      The physiological scaling values are still used as the mean of the priors on the critical scaling value for model fitting, as described on pg. 27, starting on line 698.

      Prior Work:

      While the explorations in this paper clearly have value, it does not present any particularly groundbreaking results, and those reported are consistent with previous literature.The explorations around critical eccentricity measurement have been done for texture models (Figure 11) in multiple papers (Freeman 2011, Wallis, 2019, Balas 2009). In particular, Freeman 20111 demonstrated that simpler models, representing measurements presumed to occur earlier in visual processing need smaller pooling regions to achieve metamerism. This work’s measurements for the simpler models tested here are consistent with those results, though the model details are different. In addition, Brown, 2023 (which is miscited) also used an extended field of view (though not as large as in this work). Both Brown 2023, and Wallis 2019 performed an exploration of the effect of the target image. Also, much of the more recent previous work uses color images, while the author’s exploration is only done for greyscale.

      We were pleased to find consistency of our results with previous studies, given the (many) differences in stimuli and experimental conditions (especially viewing angle), while also extending to new results with the luminance model, and the effects of initialization. Note that only one of the previous studies (Freeman and Simoncelli, 2011) used a pooled spectral energy model. Moreover, of the previous studies, only one (Brown et al., 2023) used color images (we have corrected that citation - thanks for catching the error).

      Discussion of Prior Work:

      The prior work on testing metamerism between original vs. synthesized and synthesized vs. synthesized images is presented in a misleading way. Wallis et al.’s prior work on this should not be a minor remark in the post-experiment discussion. Rather, it was surely a motivation for the experiment. The text should make this clear; a discussion of Wallis et al. should appear at the start of that section. The authors similarly cite much of the most relevant literature in this area as a minor remark at the end of the introduction (P3L72).

      The large differences we observed between comparison types (original vs synthesized, compared to synthesized vs synthesized) surprised us. Understanding such difference was not a primary motivation for the work, but it is certainly an important component of our results. In the introduction, we thought it best to lay out the basic logic of the metamer paradigm for foveated vision before mentioning the complications that are introduced in both the Wallis and Brown papers (paragraph beginning p. 3, line 109). Our results confirm and bolster the results of both of those earlier works, which are now discussed more fully in the Introduction (lines 109 and following).

      White Noise: The authors make an analogy to the inability of humans to distinguish samples of white noise. It is unclear however that human difficulty distinguishing samples of white noise is a perceptual issue- It could instead perhaps be due to cognitive/memory limitations. If one concentrates on an individual patch one can usually tell apart two samples. Support for these difficulties emerging from perceptual limitations, or a discussion of the possibility of these limitations being more cognitive should be discussed, or a different analogy employed.

      We now note the possibility of cognitive limits on pg. 8, starting on line 243, as well as pg. 22, line 571. The ability of observers to distinguish samples of white noise is highly dependent on display conditions. A small patch of noise (i.e., large pixels, not too many) can be distinguished, but a larger patch cannot, especially when presented in the periphery. This is more generally true for textures (as shown in Ziemba and Simoncelli (2021)). Samples of white noise at the resolution used in our study are indistinguishable.

      Relatedly, in Figure 14, the authors do not explain why the white noise seeds would be more likely to produce syntheses that end up in different human equivalence classes.

      In figure 14, we claim that white noise seeds are more likely to end up in the same human equivalence classes than natural image seeds. The explanation as to why we think this may be the case is now addressed on pg. 19, starting on line 423.

      It would be nice to see the effect of pink noise seeds, which mirror the power spectrum of natural images, but do not contain the same structure as natural images - this may address the artifacts noted in Figure 9b.

      The lack of pink noise seeds is now addressed on pg. 19, starting on line 429.

      Finally, the authors note high-frequency artifacts in Figure 4 & P5L135, that remain after syntheses from the luminance model. They hypothesize that this is due to a lack of constraints on frequencies above that defined by the pooling region size. Could these be addressed with a white noise image seed that is pre-blurred with a low pass filter removing the frequencies above the spatial frequency constrained at the given eccentricity?

      The explanation for this is similar to the lack of pink noise seeds in the previous point: the goal of metamer synthesis is model testing, and so for a given model, we want to find model metamers that result in the smallest possible critical scaling value. Taking white noise seed images and blurring them will almost certainly remove the high frequencies visible in luminance metamers in figure 4 and thus result in a larger critical scaling value, as the reviewer points out. However, the logic of the experiments requires finding the smallest critical scaling value, and so these model metamers would be uninformative. In an early stage of the project, we did indeed synthesize model metamers using pink noise seeds, and observed that the high frequency artifacts were less prominent.

      Schematic of metamerism: Figures 1,2,12, and 13 show a visual schematic of the state space of images, and their relationship to both model and human metamers. This is depicted as a Voronoi diagram, with individual images near the center of each shape, and other images that fall at different locations within the same cell producing the same human visual system response. I felt this conceptualization was helpful. However, implicitly it seems to make a distinction between metamerism and JND (just noticeable difference). I felt this would be better made explicit. In the case of JND, neighboring points, despite having different visual system responses, might not be distinguishable to a human observer.

      Thanks for noting this – in general, metamers are subthreshold, and for the purpose of the diagram, we had to discretize the space showing metameric regions (Voronoi regions) around a set of stimuli. We’ve rewritten the captions to explain this better. We address the binary subthreshold nature of the metamer paradigm in the discussion section (pg. 19, line 438).

      In these diagrams and throughout the paper, the phrase ’visual stimulus’ rather than ’image’ would improve clarity, because the location of the stimulus in relation to the fovea matters whereas the image can be interpreted as the pixels displayed on the computer.

      We agree and have tried to make this change, describing this choice on pg. 3 line 73.

      Other

      The authors show good reproducibility practices with links to relevant code, datasets, and figures.

      Reviewer #1 (Recommendations For The Authors):

      In its current form, I found the introduction to be too cursory. I felt that the article would benefit from a clearer motivation for the two models that are considered as the reader is left unclear why these particular models are of special scientific significance. The luminance model is intended to capture some aspects of retinal ganglion cells response characteristics and the spectral energy model is intended to capture some aspects of the primary visual cortex. However, one can easily imagine models that include the pooling of other kinds of features, and it would be helpful to get an idea of why these are not considered. Which aspects of processing in the retina and V1 are being considered and which are being left out, and why? Why not consider representations that capture even higher-order statistical structure than those covered by the spectral energy model (or even semantics)? I think a bit of rewriting with this in mind could improve the introduction.

      Along similar lines, I would have appreciated having the logic of the study explained more explicitly and didactically: which overarching research question is being asked, how it is operationalised in the models and experiments, and what are the predictions of the different models. Figures 2 and 3 are certainly helpful, but I felt further explanations would have made it easier for the reader to follow. Throughout, the writing could be improved by a careful re-reading with a view to making it easier to understand. For example, where results are presented, a sentence or two expanding on the implications would be helpful.

      I think the authors could also be more explicit about the assumptions they make. While these are obviously (tacitly) included in the description of the models themselves, it would be helpful to state them more openly. To give one example, when introducing the notion of critical scaling, on p.6 the authors state as if it is a self-evident fact that "metamers can be achieved with windows whose size is matched to that of the underlying visual neurons". This presumably is true only under particular conditions, or when specific assumptions about readout from populations of neurons are invoked. It would be good to identify and state such assumptions more directly (this is partly covered in the Discussion section ’The linking proposition underlying the metamer paradigm’, but this should be anticipated or moved earlier in the text).

      We agree that our introduction was too cursory and have reworked it. We have also backed off of the direct comparison to physiology and clarified that we chose these two as the simplest possible pooling models. We have also added sentences at the end of each result section attempting to summarize the implication (before discussing them fully in the discussion). Hopefully the logic and assumptions are now clearer.

      There are also some findings that warrant a more extensive discussion. For example, what is the broader implication of the finding that original vs. synthesised and synthesised vs. synthesised comparisons exhibit very different scaling values? Does this tell us something about internal visual representations, or is it simply capturing something about the stimuli?

      We believe this difference is a result of the stimuli that are used in the experiment and thus the synthesis procedure itself, which interacts with the model’s pooled image feature. We have attempted to update the relevant figures and discussions to clarify this, in the sections starting on pg 17 line 396 and pg. 19 line 417.

      At some points in the paper, a third model (’texture model’) creeps into the discussion, without much explanation. I assume that this refers to models that consider joint (rather than marginal) statistics of wavelet responses, as in the famous Portilla & Simoncelli texture model. However, it would be helpful to the reader if the authors could explain this.

      Addressed on pg. 3, starting on line 94.

      Minor corrections.

      Caption of Figure 3: ’top’ and ’bottom’ should be ’left’ and ’right’

      Line 177: ’smallest tested scaling values tested’. Remove one instance of ’tested’

      Line 212: ’the images-specific psychometric functions’ -> ’image-specific’

      Line 215: ’cloud-like pink noise’. It’s not literally pink noise, so I would drop this.

      Line 236: ’Importantly, these results cannot be predicted from the model, which gives no specific insight as to why some pairs are more discriminable than others’. The authors should specify what we do learn from the model if it fails to provide insight into why some image pairs are more discriminable than others.

      Figure 9: it might be helpful to include small insets with the ’highway’ and ’tiles’ source images to aid the reader in understanding how the images in 9B were generated.

      Table 1 placement should be after it is first referred to on line 258.

      In the Discussion section "Why does critical scaling depend on the comparison being performed", it would be helpful to consider the case where the two model metamers *are* distinguishable from each other even though each is indistinguishable from the target image. I would assume that this is possible (e.g., if the target image is at the midpoint between the two model images in image space and each of the stimuli is just below 1 JND away from the target). Or is this not possible for some reason?

      Regarding line 236: this specific line has been removed, and the discussion about this issue has all been consolidated in the final section of the discussion, starting on pg. 19 line 438.

      Regarding the final comment: this is addressed in the paragraph starting on pg. 16 line 386. To expand upon that: the situation laid out by the reviewer is not possible in our conceptualization, in which metamerism is transitive and image discriminability is binary. In order to investigate situations like the one laid out by the reviewer, one needs models whose representations have metric properties, i.e., which allow you to measure and reason about perceptual distance, which we refer to in the paragraph starting on pg. 20 line 460. We also note that this situation has not been observed in this or any other pooling model metamer study that we are aware of. All other minor changes have been addressed.

      Reviewer #2 (Recommendations For The Authors):

      Original image T should be marked in the Voronoi diagrams.

      Brown et al is miscited as 2021 should be ACM Transactions on Applied Perception 2023.

      Figure 3 caption: models are left and right, not top and bottom.

      Thanks, all of the above have been addressed.

      References

      BrownReral Encoding, in the Human Visual System. ACM Transactions on Applied Perception. 2023 Jan; 20(1):1–22.http://dx.doi.org/10.1145/356460, Dutell V, Walter B, Rosenholtz R, Shirley P, McGuire M, Luebke D. Efficient Dataflow Modeling of Periph-5, doi: 10.1145/3564605.

      Freeman Jdoi: 10.1038/nn.2889, Simoncelli EP. Metamers of the ventral stream. Nature Neuroscience. 2011 aug; 14(9):1195–1201..

      Ziemba CMnications. 2021 jul; 12(1)., Simoncelli EP. Opposing Effects of Selectivity and Invariance in Peripheral Vision. Nature Commu-https://doi.org/10.1038/s41467-021-24880-5, doi: 10.1038/s41467-021-24880-5.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      (1) The authors make fairly strong claims that "arousal-related fluctuations are isolated from neurons in the deep layers of the SC" (emphasis added). This conclusion is based on comparisons between a "slow drift axis", a low-dimensional representation of neuronal drift, and other measures of arousal (Figures 2C, 3) and motor output sensitivity (Figures 2B, 3B). However, the metrics used to compare the slow-drift axis and motor activity were computed during separate task epochs: the delay period (600-1100 ms) and a perisaccade epoch (25 ms before and after saccade initiation), respectively. As the authors reference, deep-layer SC neurons are typically active only around the time of a saccade. Therefore, it is not clear if the lack of arousal-related modulations reported for deep-layer SC neurons is because those neurons are truly insensitive to those modulations, or if the modulations were not apparent because they were assessed in an epoch in which the neurons were not active. A potentially more valuable comparison would be to calculate a slow-drift axis aligned to saccade onset. 

      The reviewer makes an important point that the calculation of an axis can depend critically on the time window of neuronal response. We find when considering this that the slow drift axis is less sensitive to this issue because it is calculated on time-averaged activity over multiple trials. In previous work we found that slow drift calculated on the stimulus evoked response in V4 was very well aligned to slow drift calculated on pre-stimulus spontaneous activity (Cowley et al, Neuron, 2020, Supplemental Figure 3A and 3B). To address this issue in the present data, we compared the axis computed for an example session for neural activity during the delay period and neural activity aligned to saccade onset. As shown new Figure 2 – figure supplement 1 in the revised manuscript, we found a similar lack of arousal-related modulations for deep-layer SC neurons when slow drift was computed using the saccade epoch (25ms before to 25ms after the onset of the saccade). Figure 2 – figure supplement 1A shows loadings for the SC slow drift axis when it was computed using spiking responses during the delay period (as in the main manuscript analysis). In contrast, Figure 2 – figure supplement 1B shows loadings from the same session when the SC slow drift axis was computed using spiking responses during the saccade epoch. The plots are highly similar and in both cases the loadings were weaker for neurons recorded from channels at the bottom of the probe which have a higher motor index. Finally, we found that projections onto the SC slow drift axis for this session were strongly correlated when the slow drift axis was computed using spiking responses during the delay period and the saccade epoch (r = 0.66, p < 0.001, Figure 1C). Taken together, these results suggest that arousal-related modulations are less evident in deep-layer SC neurons irrespective of whether slow drift was computed during the delay or saccade epoch (see also Public Reviews, Reviewer 1, Point 2).

      (2) More generally, arousal-related signals may persist throughout multiple different epochs of the task. It would be worthwhile to determine whether similar "slow-drift" dynamics are observed for baseline, sensory-evoked, and saccade-related activity. Although it may not be possible to examine pupil responses during a saccade, there may be systematic relationships between baseline and evoked responses. 

      Similar to the point above, slow drift dynamics tend to be similar across different response epochs because they are averaged across many trials and seem to tap into responsivity trends that are robust across epochs. As shown in Author response image 1 below, and the Figure 2 – figure supplement 1 in the revised manuscript, similar dynamics were observed when the SC slow drift axis was computed using spiking responses during the baseline, delay, visual and saccade epochs. We did not investigate differences between baseline and evoked pupil responses in the current paper. However, these effects were characterized in one of our previous papers that focused exclusively on the relationship between slow drift and eye-related metrics (Johnston et al., 2022, Cereb. Cortex, Figure 6). In this previous work, we found a negative correlation between baseline and evoked pupil size. Both variables were significantly correlated with slow drift, the only difference being the sign of the correlation.

      Author response image 1.

      (A-C) Dynamics of slow drift for three example sessions when the SC slow drift axis was computed using spiking responses during the baseline, delay, visual and saccade epochs. Baseline = 100ms before the onset of the target stimulus; Delay = 600 to 1100ms after the offset of the target stimulus; Stim = 25ms to 125ms after the onset of the target stimulus; Sac = 25ms before to 25ms after the onset of the saccade.

      Johnston R, Snyder AC, Khanna SB, Issar D, Smith MA (2022) The eyes reflect an internal cognitive state hidden in the population activity of cortical neurons. Cereb Cortex 32:3331–3346.

      (3) The relationships between changes in SC activity and pupil size are quite small (Figures 2C & 5C). Although the distribution across sessions (Figure 2C) is greater than chance, they are nearly 1/4 of the size compared to the PFC-SC axis comparisons. Likewise, the distribution of r2 values relating pupil size and spiking activity directly (Figure 5) is quite low. We remain skeptical that these drifts are truly due to arousal and cannot be accounted for by other factors. For example, does the relationship persist if accounting for a very simple, monotonic (e.g., linear) drift in pupil size and overall firing rate over the course of an individual session? 

      Firstly, it is important to note that the strength of the relationship between projections onto the SC slow drift axis and pupil size (r<sup>2</sup> = 0.06) is within the range reported by Joshi et al. (2016, Neuron, Figure 3). They investigated the median variance explained between the spiking responses of individual SC neurons and pupil size and found it to be approximately 0.02 across sessions. Secondly, our statistical approach of testing the actual distribution of r<sup>2</sup> values against a shuffled distribution was specifically designed to rule out the possibility that the relationship between SC spiking responses and pupil size occurred due to linear drifts. The shuffled distribution in Figure 2C of the main manuscript represents the variance that can be explained by one session’s slow drift correlated with another session’s pupil, which would contain effects that occurred due to linear drifts alone. That the actual proportion of variance explained was significantly greater than this distribution suggests that the relationship between projections onto the SC slow drift axis and pupil size reflects changes in arousal rather than other factors related to linear drifts.

      Joshi S, Li Y, Kalwani RM, Gold JI (2016) Relationships between Pupil Diameter and Neuronal Activity in the Locus Coeruleus, Colliculi, and Cingulate Cortex. Neuron 89:221–234.

      (4) It is not clear how the final analysis (Figure 6) contributes to the authors' conclusions. The authors perform PCA on: (i) residual spiking responses during the delay period binned according to pupil size, and (ii) spiking responses in the saccade epoch binned according to target location (i.e., the saccade tuning curve). The corresponding PCs are the spike-pupil axis and the saccade tuning axis, respectively. Unsurprisingly, the spikepupil axis that captures variance associated with arousal (and removes variance associated with saccade direction) was not correlated with a saccade-tuning axis that captures variance associated with saccade direction and omits arousal. Had these measures been related it would imply a unique association between a neuron's preferred saccade direction and pupil control- which seems unlikely. The separation of these axes thus seems trivial and does not provide evidence of a "mechanism...in the SC to prevent arousal-related signals interfering with the motor output." It remains unknown whether, for example, arousal-related signals may impact trial-by-trial changes in neuronal gain near the time of a saccade, or alter saccade dynamics such as acceleration, precision, and reaction time. 

      The reviewer makes a good point, and we agree that more evidence is needed to determine if the separation of the pupil size axis and saccade tuning axis is the mechanism through which cognitive and arousal-related signals can be intermixed in the SC. In the revised manuscript (lines 679-682), we have raised this as a possible explanation that necessitates further study rather than stating definitively that it is the exact mechanism through which these signals are kept separate. Our analysis here is similar to the one from Smoulder et al (2024, Neuron, Fig. 2F), in which the interactions between reward signals and target tuning in M1 were examined (and found to be orthogonal). While we agree with the reviewer that it may seem “trivial” for these axes to be orthogonal, it does not have to be so. If, for example, neural tuning curves shifted with changes in pupil size through gain changes that revealed tuning or affected tuning curve shape, there could be projections of the pupil axis onto the target tuning axis. Thus, while we agree with the reviewer that it appears sensible for these two axes to be orthogonal, our result is nonetheless a novel finding. We have edited the text in our revised manuscript, however, to make sure the nuance of this point is conveyed to the reader.

      Smoulder AL, Marino PJ, Oby ER, Snyder SE, Miyata H, Pavlovsky NP, Bishop WE, Yu BM, Chase SM, Batista AP. A neural basis of choking under pressure. Neuron. 2024 Oct 23;112(20):3424-33.

      Reviewer #2 (Public Review):

      (1) The greatest weakness in the present research is the fact that arousal is a functionally less important non-motoric variable. The authors themselves introduce the problem with a discussion of attention, which is without any doubt the most important cognitive process that needs to be functionally isolated from oculomotor processes. Given this introduction, one cannot help but wonder, why the authors did not design an experiment, in which spatial attention and oculomotor control are differentiated. Absent such an experiment, the authors should spend more time explaining the importance of arousal and how it could interfere with oculomotor behavior. 

      Although attention does represent an important cognitive process, we did not design an experiment in which attention and oculomotor control are differentiated because attention does not appear to be related to slow drift. In our first paper that reported on this phenomenon, we investigated the effects of spatial attention on slow fluctuations in neural activity by cueing the monkeys to attend to a stimulus in the left or right visual field in a block-wise manner. Each block lasted ~20 minutes and we found that slow drift did not covary with the timing of cued blocks (see Figure 4A, Cowley et al., 2020, Neuron). Furthermore, there is a large body of work showing that arousal also impacts motor behavior leading to changes in a range of eye-related metrics (e.g., pupil size, microsaccade rate and saccadic reaction time - for review, see Di Stasi et al. 2013, Neurosci. Biobehav. Rev.). We also note that the terms attention and arousal are often used in nonspecific and overlapping ways in the literature, adding to some potential confusion here. Nonetheless, pupil-linked arousal is an important variable that impacts motor performance. This has now been stated clearly in the Introduction of the revised manuscript (lines 108-114) to address the reviewer’s concerns and highlight the importance of studying how precise fixation and eye movements are maintained even in the presence of signals related to ongoing changes in brain state. 

      Cowley BR, Snyder AC, Acar K, Williamson RC, Yu BM, Smith MA (2020) Slow Drift of Neural Activity as a Signature of Impulsivity in Macaque Visual and Prefrontal Cortex. Neuron 108:551-567.e8.

      (2) In this context, it is particularly puzzling that one actually would expect effects of arousal on oculomotor behavior. Specifically, saccade reaction time, accuracy, and speed could be influenced by arousal. The authors should include an analysis of such effects. They should also discuss the absence or presence of such effects and how they affect their other results. 

      As described above, several studies across species have demonstrated that arousal impacts motor behavior e.g., saccade reaction time, saccade velocity and microsaccade rate (for review, see Di Stasi et al. 2013, Neurosci. Biobehav. Rev.). This has been clarified in the Introduction of the revised manuscript to address the reviewer's concerns (lines 108-114). Our prior work (Johnston et al, Cerebral Cortex, 2022) shows that slow drift impacts several types of oculomotor behavior. Overall, these studies highlight the impact of arousal on eye movements as a robust effect, and support the present investigation into arousal and oculomotor control signals. While we agree reaction time, accuracy, and speed all can be influenced by arousal depending on task demands, the present study is focused on the connection between slow fluctuations in neural activity, linked to arousal, and different subpopulations of SC neurons. 

      Di Stasi LL, Catena A, Cañas JJ, Macknik SL, Martinez-Conde S (2013) Saccadic velocity as an arousal index in naturalistic tasks. Neurosci Biobehav Rev 37:968–975.

      Johnston R, Snyder AC, Khanna SB, Issar D, Smith MA (2022) The eyes reflect an internal cognitive state hidden in the population activity of cortical neurons. Cereb Cortex 32:3331–3346.

      (3) The authors use the analysis shown in Figure 6D to argue that across recording sessions the activity components capturing variance in pupil size and saccade tuning are uncorrelated. however, the distribution (green) seems to be non-uniform with a peak at very low and very high correlation specifically. The authors should test if such an interpretation is correct. If yes, where are the low and high correlations respectively? Are there potentially two functional areas in SC? 

      We agree with the reviewer that our actual data distribution was non-uniform. We examined individual sessions with high and low variance explained and did not find notable differences. One source of this variation has to do with session length. Longer sessions in principle should have a chance distribution of variance explained closer to zero because they contained more time bins. Given that we had no specific hypothesis for a non-uniform distribution, we have simply displayed the full distribution of values in our figure and the statistical result of a comparison to a shuffled distribution.

      Reviewer #3 (Public Review):

      (1) However, I am concerned about two main points: First, the authors repeatedly say that the "output" layers of the SC are the ones with the highest motor indices. This might not necessarily be accurate. For example, current thresholds for evoking saccades are lowest in the intermediate layers, and Mohler & Wurtz 1972 suggested that the output of the SC might be in the intermediate layers. Also, even if it were true that the high motor index neurons are the output, they are very few in the authors' data (this is also true in a lot of other labs, where it is less likely to see purely motor neurons in the SC). So, this makes one wonder if the electrode channels were simply too deep and already out of the SC? In other words, it seems important to show distributions of encountered neurons (regardless of the motor index) across depth, in order to better know how to interpret the tails of the distributions in the motor index histogram and in the other panels of Figure Supplement 1. I elaborate more on these points in the detailed comments below. 

      The reviewer makes a good point about the efferent signals from SC. It is true that electrical thresholds are often lowest in intermediate layers, though deep layers do project to the oculomotor nuclei (Sparks, 1986; Sparks & Hartwich-Young, 1989) and often intermediate and deep layers are considered to function together to control eye movements (Wurtz & Albano, 1980). As suggested by the reviewer, we have edited the text throughout the manuscript to say that slow drift was less evident in SC neurons with a higher motor index, as well as included the above references and points about the intermediate and deep layers (Lines 73-81). Aside from the question of which layers of the SC function as the “motor output”, the reviewer raises a separate and important question – are our deep recordings still in SC. Here, we can say definitively that they are. We removed neurons if they did not exhibit elevated (above baseline) firing rates during the visual or saccade epochs of the MGS task (see Methods section on “Exclusion criteria”). All included neurons possessed a visual, visuomotor or motor response, consistent with the response properties of neurons in the SC. In addition, we found a number of neurons well above the bottom of the probe with strong motor responses and minimal loadings onto the slow drift axis (see Figure 2 – figure supplement 1A), consistent with the reviewer’s comment that intermediate layer neurons are tuned for movement and play a role in saccade production.

      Mohler CW, Wurtz RH. Organization of monkey superior colliculus: intermediate layer cells discharging before eye movements. Journal of neurophysiology. 1976 Jul 1;39(4):722-44.

      Sparks DL. Translation of sensory signals into commands for control of saccadic eye movements: role of primate superior colliculus. Physiol Rev. 1986 Jan;66(1):118-71. doi: 10.1152/physrev.1986.66.1.118. PMID: 3511480.

      Sparks DL, Hartwich-Young R. The deep layers of the superior colliculus. Reviews of oculomotor research. 1989 Jan 1;3:213-55.

      Wurtz RH, Albano JE. Visual-motor function of the primate superior colliculus. Annu Rev Neurosci. 1980;3:189-226. doi: 10.1146/annurev.ne.03.030180.001201. PMID: 6774653.

      (2) Second, the authors find that the SC cells with a low motor index are modulated by pupil diameter. However, this could be completely independent of an "arousal signal". These cells have substantial visual responses. If the pupil diameter changes, then their activity should be influenced since the monkey is watching a luminous display. So, in this regard, the fact that they do not see "an arousal signal" in most motor neurons (through the pupil diameter analyses) is not evidence that the arousal signal is filtered out from the motor neurons. It could simply be that these neurons simply do not get affected by the pupil diameter because they do not have visual sensitivity. So, even with the pupil data, it is still a bit tricky for me to interpret that arousal signals are excluded from the "output layers" of the SC. 

      The reviewer makes an important point about the SC’s visual responses. Neurons with a low motor index are, conversely, likely to have a stronger visual response index. However, we do not believe that changes in luminance can explain why the correlation between SC spiking response and pupil size is weaker for neurons with a lower motor index. Firstly, the changes in pupil size observed in the current paper and our previous work are slow and occur on a timescale of minutes (Cowley et al., 2020, Neuron) and are correlated with eye movement measures such as reaction time and microsaccade rate (Johnston et al., 2022, Cerebral Cortex). This is in stark contrast to luminance-evoked changes in pupil size that occur on a timescale of less than a second. Secondly, as shown the new Figure 5 – figure supplement 1 in the revised manuscript, very similar results were found when SC spiking responses were correlated with pupil size during the baseline period, when only the fixation point was on the screen. Although the luminance of the small peripheral target stimulus can result in small luminance-evoked changes in pupil size, no changes in luminance occurred during the baseline period which was defined as 100ms before the onset of the target stimulus. In Figure 2 – figure supplement 1 and Author response image 1 above, we show that slow drift is the same whether calculated on the baseline response, delay period, or peri-saccadic epoch. Thus, the measurement of slow drift is insensitive to the precise timing of the selection of both the window for the spiking response and the window for the pupil measurement. If luminance were the explanation for the slow changes in firing observed in visually responsive SC neurons, it would require those neurons to exhibit robust, sustained tuned responses to the small changes in retinal illuminance induced by the relatively small fluctuations in pupil size we observed from minute to minute. We are aware of no reports of such behavior in visually-responsive neurons in SC. We have included these analyses and this reasoning in the revised manuscript on lines 478-495.

      Reviewer#1 (Recommendations for the author):

      (1) It would be useful to provide line numbers in subsequent manuscripts for reviewers.

      Line numbers have been added in the revised version of the manuscript.

      (2) Page #6; last sentence: "...even impact processing at the early to mid stages of the visuomotor transformation, without leading to unwanted changes in motor output." I do not believe the authors have provided evidence that arousal levels were not associated with changes in motor output.

      As suggested by Reviewer 3 (see Public Reviews, Reviewer 3, Point 2), we have edited the text throughout the manuscript to say that slow drift was less evident in SC neurons with a higher motor index. This sentence in the revised manuscript now reads:

      “This provides a potential mechanism through which signals related to cognition and arousal can exist in the SC, and even impact processing at the early to mid stages of the visuomotor transformation, without leading to unwanted changes in SC neurons that are linked to saccade execution.”

      (3) Page #8; last paragraph: Although deep-layer SC neurons may not have been obtained during every recording session, a summary of the motor index scores observed along the probe across sessions would be useful to confirm their assumptions. 

      See Author response image 2 below which shows the motor index of each recoded SC neuron on the x-axis and session number on the y-axis. The points are colored by to the squared factor loading which represents the variance explained between the response a neuron and the slow drift axis (see Figure 3B of the main manuscript). You can see from this plot that neurons with a stronger component loading (shown in teal to yellow) typically have a lower motor index whereas the opposite is true for neurons with a weaker component loading (shown in dark blue).

      Author response image 2.

      Scatter plot showing the motor index of each recorded neuron along with the session number in which it was recorded. The points are colored by to the squared factor loading for each neuron along the slow drift axis. Note that loadings above 0.5 (33 data points in total) have been thresholded at 0.5 so that we could effectively use the color range to show all of the slow drift axis loadings.

      (4) Page #10; first paragraph: The authors should state the time window of the delay period used, since it may be distinct from the pupil analysis (first 200ms of delay). 

      This has been stated in the revised version of the manuscript. The sentence now reads:

      “We first asked if arousal-related fluctuations are present in the SC. As in previous studies that recorded from neurons in the cortex (Cowley et al., 2020), we found that the mean spiking responses of individual SC neurons during the delay period (chosen at random on each trial from a uniform distribution spanning 600-1100ms, see Methods) fluctuated over the course of a session while the monkeys performed the MGS task (Figure 2A, left).”

      (5) Page #10; second paragraph: Extra period at the end of a sentence: " most variance in the data..". 

      Fixed in the revised version of the manuscript.

      (6) Page #12: "between projections onto the SC slow drift axis and mean pupil size during the first 200ms of the delay period when a task-related pupil response could be observed." What criteria was used to determine whether a task-related pupil response was observed? 

      This was chosen based on the results of a previous study in our lab that used the same memory-guided saccade task to investigate the relationship between slow drift and changes in based and evoked pupil size (see Johnston et al., 2022, Cereb. Cortex, Figure 6B). The period was chosen based on plotting the average pupil size aligned on different trial epochs. As we show in Figure 5-figure supplement 3 above, the pupil interactions with slow drift did not depend on the particular time window of the pupil we chose.  

      (7) Page #14; Figure 2A: The axes for the individual channels are strangely floating and quite different from all other figures. Please label the channel in the figure legend that was used as an example of the projected values onto the slow drift axis.

      The figure has been changed in the revised version of the manuscript so that the tick mark denoting zero residual spikes per second is on the top layer of each plot. A scale bar was chosen instead of individual axes to reduce clutter in the figure as it was used to demonstrate how slow drift was computed. Residual spiking responses from all neurons were projected on the slow drift axis to generate the scatter plot in the bottom right-hand corner of Figure 2A. There is no single neuron to label.

      (8) Page #16: "These results demonstrate that even though arousal-related fluctuations are present in the SC, they are isolated from deep-layer neurons that elicit a strong saccadic response and presumably reside closer to the motor output." In line with our major comments, lack of arousal-related activity during the delay period is meaningless for deep-layer SC neurons that are generally inactive during this time. It does not imply that there is no arousal signal! 

      Addressed in Public Reviews, Reviewer 1, Point 1 & 2. We found a similar lack of arousal-related modulations reported for deep-layer SC neurons when slow drift was computed using the saccade epoch (Figure 1 above). In addition, similar dynamics were observed when the SC slow drift axis was computed using spiking responses during the baseline, delay, visual and saccade period (Figure 2).

      (9) Page #18: "These findings provide additional support for the hypothesis that arousalrelated fluctuations are isolated from neurons in the deep layers of the SC." The same criticism from above applies.

      Addressed in Public Reviews, Reviewer 1, Point 1 & 2.

      (10) Page #20; paragraph 3: "Taken together, the findings outlined above..." Would be useful to be more specific when referring to "activity" ; e.g., "...these neurons did not exhibit large fluctuations in delay-period activity over time".

      This sentence has been changed in the revised manuscript in light of the reviewer’s comments. It now reads:

      “In addition to being more weakly correlated with pupil size, the spiking responses of these neurons did not exhibit large fluctuations over time (Figure 2), and when considering the neuronal population as a whole, explained less variance in the slow drift axis when it was computed using population activity in the SC (Figure 3) and PFC (Figure 4).”

      Reviewer #3 (Recommendations for the author):

      The paper is clear and well-written. However, I am concerned about two main points: 

      (1) First, the authors repeatedly say that the "output" layers of the SC are the ones with the highest motor indices. This might not necessarily be accurate. For example, current thresholds for evoking saccades are lowest in the intermediate layers, and Mohler & Wurtz 1972 suggested that the output of the SC might be in the intermediate layers. Also, even if it were true that the high motor index neurons are the output, they are very few in the authors' data (this is also true in a lot of other labs, where it is less likely to see purely motor neurons in the SC). So, this makes one wonder if the electrode channels were simply too deep and already out of the SC. In other words, it seems important to show distributions of encountered neurons (regardless of motor index) across depth, in order to better know how to interpret the tails of the distributions in the motor index histogram and in the other panels of the figure supplement 1. I elaborate more on these points in the detailed comments below. 

      Addressed in Public Reviews, Reviewer 3, Point 1.

      (2) Second, the authors find that the SC cells with a low motor index are modulated by pupil diameter. However, this could be completely independent of an "arousal signal". These cells have substantial visual responses. If the pupil diameter changes, then their activity should be influenced since the monkey is watching a luminous display. So, in this regard, the fact that they do not see "an arousal signal" in most motor neurons (through the pupil diameter analyses) is not evidence that the arousal signal is filtered out from the motor neurons. It could simply be that these neurons simply do not get affected by the pupil diameter because they do not have visual sensitivity. So, even with the pupil data, it is still a bit tricky for me to interpret that arousal signals are excluded from the "output layers" of the SC. 

      Addressed in Public Reviews, Reviewer 3, Point 2.

      (3) I think that a remedy to the first point above is to change the text to make it a bit more descriptive and less interpretive. For example, just say that the slow drifts were less evident among the neurons with high motor index. 

      We thank the reviewer for this suggestion (see Public Reviews, Reviewer 3, Point 1).

      (4) For the second point, I think that it is important to consider the alternative caveat of different amounts of light entering the system. Changes in light level caused by pupil diameter variations can be quite large. 

      We thank the reviewer for this suggestion (see Public Reviews, Reviewer 3, Point 2).

      (5) Line 31: I'm a bit underwhelmed by this kind of statement. i.e. we already know that cognitive processes and brain states do alter eye movements, so why is it "critical" that high precision fixation and eye movements are maintained? And, isn't the next sentence already nulling this idea of criticality because it does show that the brain state alters the SC neurons? In fact, cognitive processes are already known to be most prevalent in the intermediate and deep layers of the SC. 

      It seems clear that while cognitive state does affect eye movements, it is desirable to have some separation between cognitive state and eye movement control. Covert attention, for instance, is precisely a situation where eye movement control is maintained to avoid overt saccades to the attended stimulus, and yet there are clear indications of attention’s impact on microsaccades and fixation. We stand by our statement that an important goal of vision is to have precise fixation and movements of the eye, and yet at the same time the eyes are subject to numerous influences by cognitive state.

      (6) Line 65: it is better to clarify that these are "functional layers" because there are actually more anatomical layers. 

      We have edited this sentence in the revised version of the manuscript so that it now reads:

      “The role of these projections in the visuomotor transformation depends on the functional layer of the SC in which they terminate”.

      (7) Line 73: this makes it sound like only the deepest layers are topographically organized, which is not true. Also, as early as Mohler & Wurtz, 1972, it was suggested that the intermediate layers have the biggest impacts downstream of the SC. This is also consistent with electrical microstimulation current thresholds for evoking saccades from the SC. 

      We have addressed the reviewers’ comments about the intermediate layers having the biggest impact downstream of the SC in Public Reviews, Reviewer 3, Point 1. Furthermore, line 73 has been changed in the revised manuscript so that it now reads:

      “As is the case for neurons in the superficial and intermediate layers, they [SC motor neurons] form a topographically organized map of visual space (White et al. 2017; Robinson 1972; Katnani and Gandhi 2011)”.  

      (8) Line 100: there is an analogous literature regarding the question of why unwanted muscle contractions do not happen. Specifically, in the context of why SC visual bursts do not automatically cause saccades (which is a similar problem to the ones you mention about cognitive signals interfering by generating unwanted eye movements), both Jagadisan & Gandhi, Curr Bio, 2022 and Baumann et al, PNAS, 2023 also showed that SC population activity not only has different temporal structure (Jagadisan & Gandhi) but also occupy different subspaces (Baumann et al) under these two different conditions (visual burst versus saccade burst). This is conceptually similar to the idea that you are mentioning here with respect to arousal. So, it is worth it to mention these studies here and again in the discussion. 

      We are grateful to the reviewer for these suggestions and have included text in the Introduction (Lines 125-128) and Discussion (Lines 678-682) of the revised manuscript along with the references cited above.

      (9) Line 147: as mentioned above, it is now generally accepted that there are quite a few "pure" motor neurons in the SC. This is consistent with what you find. E.g. Baumann et al., 2023. And, again see Mohler and Wurtz in the 1970's. So, I wonder how useful it is to go too much into this idea of the deeper motor neurons (e.g. the correlations in the other panels of the Figure 1 supplement). 

      This is related to the reviewer’s comment that the output of the SC might be in the intermediate layers. This concern has been addressed in Public Reviews, Reviewer 3, Point 1.

      (10) Figure 1 should say where the RF was for the shown spike rasters. i.e. were these the same saccade target across trials? And where was that location relative to the RF? It would help also in the text to say whether the saccade was always to the RF center or whether you were randomizing the target location. 

      We centered the array of saccade targets using the microstimulation-evoked eye movement for SC (see Methods section “Memory-guided saccade task”) to find the evoked eccentricity, and then used saccade targets with equal spacing of 45 degrees starting at zero (rightward saccade target). We did not do extensive RF mapping beyond this microstimulation centering. In Figure 1, the spike rasters are shown for a target that was visually identified to be within the neuron’s RF based on assessing responses to all 8 target angles. We have added information about this to the figure caption.

      (11) Line 218: but were there changes in the eye movement statistics? For example, the slow drift eye movements during fixation? Or even the microsaccades? 

      Addressed in Public Reviews, Reviewer 2, Point 2.  

      (12) Line 248: shuffling what exactly? I think that more explanation would be needed here. 

      Addressed in Public Reviews, Reviewer 1, Point 3.  

      (13) Line 263: but isn't this reflecting a sensory transient in the pupil diameter, since the target just disappeared? 

      Addressed in Public Reviews, Reviewer 3, Point 2.  

      (14) Line 271: I suspect that slow drift eye movements (in between microsaccades) would show higher correlations. Not sure how well you can analyze those with a video-based eye tracker. 

      We agree that fixational drift would be a worthwhile metric, but it is not one we have focused on here and to our knowledge does require higher precision tracking. 

      (15) Line 286: again, see above about similar demonstrations with respect to the visual and motor burst intervals, which clearly cause the same problem (even stronger) as the one studied here. 

      See reply, including Figure 2.

      (16) Line 330: again, I'm not sure deeper necessarily automatically means closer to the output. For example, current thresholds for evoked saccades grow higher as you go deeper. Maybe the authors can ask their colleague Neeraj Gandhi about this point specifically, just to be safe. Maybe the safest would be to remain descriptive about the data, and just say something like: arousal-related fluctuations were absent in our deepest recorded sites. 

      Addressed in Public Reviews, Reviewer 3, Point 1.

      (17) Line 332: likewise, statements like this one here would be qualified if the output was the intermediate layers......anyway if I understand what I read so far in the paper, the signal will be anyway orthogonal to the motor burst population subspace. So, maybe there's no need to emphasize that it goes away in the very deepest layers. 

      See reply above, Public Reviews, Reviewer 1, Point 4.

      (18) Figure 3A: related to the above, I think one issue could be that the deeper contacts might already be out of the SC. Maybe some cell count distribution from each channel should help in this regard. i.e. were you finding way fewer saccade-related neurons in the deepest channels (even though the few that you found were with high motor index)? If so, then wouldn't this just mean that the channel was too deep? I think there needs to be an analysis like this, to convince readers that the channels were still in the SC. Ideally, electrical stimulation current thresholds for evoking saccades at different depths would be tested, but I understand that this can be difficult at this stage. 

      Addressed in Public Reviews, Reviewer 3, Point 1.

      (19) I keep repeating this because in general, cognitive effects are stronger in the intermediate/deeper layers than in the superficial layers. If these interfere with eye movements like arousal, then why should arousal be different?

      Few studies have investigated the effects of attention on “pure” movement SC neurons that only discharge during a saccade. One study, which we cited in Introduction (Ignashchenkova et al., 2004, Nat. Neurosci.), found significant differences in spiking responses between trials with and without attentional cueing for visual and visuomotor neurons. No significant difference was found for motor neurons, consistent with our hypothesis that signals related to cognition and arousal are kept separate from saccade-related signals in the SC.

      (20) The problem with Figure 5 and its related text is that the neurons with low motor index are additionally visual. So, of course, they can be modulated if the pupil diameter changes!

      Addressed in Public Reviews, Reviewer 3, Point 2.  

      (21) I had a hard time understanding Figure 6. 

      See reply above, Public Reviews, Reviewer 1, Point 4.

      (22) Line 586: these cells have more visual responses and will be affected by the amount of light entering the eye. 

      Addressed in Public Reviews, Reviewer 3, Point 2.

    1. Establishing the boundaries for your research may come from your instructor’s assignment guidelines.

      I completely agree with this sentence I think establishing boundaries for your research is especially important but starting off with what your teacher has is important. For the context of academic papers written as a student your audience is a bit ambiguous generally speaking the only people who will read your academic papers is your professor and so understanding the guidelines and what necessarily the professor needs out of that paper is important. The purpose of the paper is to demonstrate that you not only can do research but that you are actively learning engaging and articulating the information you are researching. It's important that not only instructor headlines are clear and concise but also that in the moments that they aren't that we are asking and refining to ensure that it is an acceptable essay for the assignment.

    1. I use the end-pa-pers at the back of the book to makea personal index of the author's pointsin the order of their appearance

      The making of a personal index is a first step in building a mesh of knowledge. In just a few years, Vannevar Bush will speak of "associative trails" a phrase he uses twice in "As We May Think" (The Atlantic, July 1945), but of potentially more import is his phrase "associative indexing" which lays way to either juxtaposing or linking two ideas (either similar or disjoint) together. It bears asking the question of of whether it's more valuable to index and juxtapose similar ideas or disjoint ideas which may more frequently lead to better, more useful, and more relevant and rich future ideas.

      It affords an immediate step, however, to associative indexing, the basic idea of which is a provision whereby any item may be caused at will to select immediately and automatically another. This is the essential feature of the memex. The process of tying two items together is the important thing. Bush, Vannevar. 1945. “As We May Think.” The Atlantic 176: 101–8. https://www.theatlantic.com/magazine/archive/1945/07/as-we-may-think/303881/ (October 22, 2022). #

    1. It affords an immediate step, however, to associative indexing, the basic idea of which is a provision whereby any item may be caused at will to select immediately and automatically another. This is the essential feature of the memex. The process of tying two items together is the important thing.

      See also the precursor of personal indexing which Mortimer J. Adler mentions in 1940: https://hypothes.is/a/cPcoAqhVEfC0rJOZ0Pm-8Q

    1. Reviewer #3 (Public review):

      Summary

      The authors set out to explore the potential relationship between adult neurogenesis of inhibitory granule cells in the olfactory bulb and cumulative changes over days in odor-evoked spiking activity (representational drift) in the olfactory stream. They developed a richly detailed spiking neuronal network model based on Izhikevich (2003), allowing them to capture the diversity of spiking behaviors of multiple neuron types within the olfactory system. This model recapitulates the circuit organization of both the main olfactory bulb (MOB) and the piriform cortex (PCx), including connections between the two (both feedforward and corticofugal). Adult neurogenesis was captured by shuffling the weights of the model's granule cells, preserving the distribution of synaptic weights. Shuffling of granule cell connectivity resulted in cumulative changes in stimulus-evoked spiking of the model's M/T cells. Individual M/T cell tuning changed with time, and ensemble correlations dropped sharply over the temporal interval examined (long enough that almost all granule cells in the model had shuffled their weights). Interestingly, these changes in responsiveness did not disrupt low-dimensional stability of olfactory representations: when projected into a low-dimensional subspace, population vector correlations in this subspace remained elevated across the temporal interval examined. Importantly, in the model's downstream piriform layer, this was not the case. There, shuffled GC connectivity in the bulb resulted in a complete shift in piriform odor coding, including for low-dimensional projections. This is in contrast to what the model exhibited in the M/T input layer. Interestingly, these changes in PCx extended to the geometrical structure of the odor representations themselves. Finally, the authors examined the effect of experience on representational drift. Using an STDP rule, they allowed the inputs to and outputs from adult-born granule cells to change during repeated presentations of the same odor. This stabilized stimulus-evoked activity in the model's piriform layer.

      Strengths

      This paper suggests a link between adult neurogenesis in the olfactory bulb and representational drift in the piriform cortex. Using an elegant spiking network that faithfully recapitulates the basic physiological properties of the olfactory stream, the authors tackle a question of longstanding interest in a creative and interesting manner. As a purely theoretical study of drift, this paper presents important insights: synaptic turnover of recurrent inhibitory input can destabilize stimulus-evoked activity, but only to a degree, as representations in the bulb (the model's recurrent input layer) retain their basic geometrical form. However, this destabilized input results in profound drift in the model's second (piriform) layer, where both the tuning of individual neurons and the layer's overall functional geometry are restructured. This is a useful and important idea in the drift field, and to my knowledge, it is novel. The bulb is not the only setting where inhibitory synapses exhibit turnover (whether through neurogenesis or synaptic dynamics), and so this exploration of the consequences of such plasticity on drift is valuable. The authors also elegantly explore a potential mechanism to stabilize representations through experience, using an STDP rule specific to the inhibitory neurons in the input layer. This has an interesting parallel with other recent theoretical work on drift in the piriform (Morales et al., 2025 PNAS), in which STDP in the piriform layer was also shown to stabilize stimulus representations there. It is fascinating to see that this same rule also stabilizes piriform representations when implemented in the bulb's granule cells.

      The authors also provide a thoughtful discussion regarding the differential roles of mitral and tufted cells in drift in piriform and AON and the potential roles of neurogenesis in archicortex.

      In general, this paper puts an important and much-needed spotlight on the role of neurogenesis and inhibitory plasticity in drift. In this light, it is a valuable and exciting contribution to the drift conversation.

      Weaknesses

      I have one major, general concern that I think must be addressed to permit proper interpretation of the results.

      I worry that the authors' model may confuse thinking on drift in the olfactory system, because of differences in the behavior of their model from known features of the olfactory bulb. In their model, the tuning of individual bulbar neurons drifts over time. This is inconsistent with the experimental literature on the stability of odor-evoked activity in the olfactory bulb.

      In a foundational paper, Bhalla & Bower (1997) recorded from mitral and tufted cells in the olfactory bulb of freely moving rats and measured the odor tuning of well-isolated single units across a five-day interval. They found that the tuning of a single cell was quite variable within a day, across trials, but that this variability did not increase with time. Indeed, their measure of response similarity was equivalent within and across days. In what now reads as a prescient anticipation of the drift phenomenon, Bhalla and Bower concluded: "it is clear, at least over five days, that the cell is bounded in how it can respond. If this were not the case, we would expect a continual increase in relative response variability over multiple days (the equivalent of response drift). Instead, the degree of variability in the responses of single cells is stable over the length of time we have recorded." Thus, even at the level of single cells, this early paper argues that the bulb is stable.

      This basic result has since been replicated by several groups. Kato et al. (2012) used chronic two-photon calcium imaging of mitral cells in awake, head-fixed mice and likewise found that, while odor responses could be modulated by recent experience (odor exposure leading to transient adaptation), the underlying tuning of individual cells remained stable. While experience altered mitral cell odor responses, those responses recovered to their original form at the level of the single neuron, maintaining tuning over extended periods (two months). More recently, the Mizrahi lab (Shani-Narkiss et al., 2023) extended chronic imaging to six months, reporting that single-cell odor tuning curves remained highly similar over this period. These studies reinforce Bhalla and Bower's original conclusion: despite trial-to-trial variability, olfactory bulb neurons maintain stable odor tuning across extended timescales, with plasticity emerging primarily in response to experience. (The Yamada et al., 2017 paper, which the authors here cite, is not an appropriate comparison. In Yamada, mice were exposed daily to odor. Therefore, the changes observed in Yamada are a function of odor experience, not of time alone. Yamada does not include data in which the tuning of bulb neurons is measured in the absence of intervening experience.)

      Therefore, a model that relies on instability in the tuning of bulbar neurons risks giving the incorrect impression that the bulb drifts over time. This difference should be explicitly addressed by the authors to avoid any potential confusion. Perhaps the best course of action would be to fit their model to Mizrahi's data, should this data be available, and see if, when constrained by empirical observation, the model still produces drift in piriform. If so, this would dramatically strengthen the paper. If this is not feasible, then I suggest being very explicit about this difference between the behavior of the model and what has been shown empirically. I appreciate that in the data there is modest drift (e.g., Shani-Narkiss' Figure 8C), but the changes reported there really are modest compared to what is exhibited by the model. A compromise would be to simply apply these metrics to the model and match the model's similarity to the Shani-Narkiss data. Then the authors could ask what effect this has on drift in piriform.

      The risk here is that people will conclude from this paper that drift in piriform may simply be inherited from instability in the bulb. This view is inconsistent with what has been documented empirically, and so great care is warranted to avoid conveying that impression to the community.

      Major comments (all related to the above point)

      (1) Lines 146-168: The authors find in their model that "individual M/T cells changed their responses to the same odor across days due to adult-neurogenesis, with some cells decreasing the firing rate responses (Fig.2A1 top) while other cells increased the magnitude of their responses (Fig. 2A2 bottom, Fig. S2)" they also report a significant decrease in the "full ensemble correlation" in their model over time. They claim that these changes in individual cell tuning are "similar to what has been observed by others using calcium imaging of M/T cell activity (Kato et al., 2012 and Yamada et al., 2017)" and that the decrease in full ensemble correlation is "consistent with experimental observations (Yamada et al., 2017)." However, the conditions of the Kato and Yamada experiments that demonstrate response change are not comparable here, as odors were presented daily to the animals in these experiments. Therefore, the changes in odor tuning found in the Kato and Yamada papers (Kato Figure 4D; Yamada Figure 3E) are a function of accumulated experience with odor. This distinction is crucial because experience-induced changes reflect an underlying learning process, whereas changes that simply accumulate over time are more consistent with drift. The conditions of their model are more similar to those employed in other experiments described in Kato et al. 2012 (Figure 6C) as well as Shani-Narkiss et al. (2023), in which bulb tuning is measured not as a function of intervening experience, but rather as a function of time (Kato's "recovery" experiment). What is found in Kato is that even across two months, the tuning of individual mitral cells is stable. What alters tuning is experience with odor, the core finding of both the Kato et al., 2012 paper and also Yamada et al., 2017. It is crucial that this is clarified in the text.

      (2) The authors show that in a reduced-space correlation metric, the correlation of low-dimensional trajectories "remained high across all days"..."consistent with a recent experimental study" (Shani-Narkiss et al., 2023). It is true that in the Shani-Narkiss paper, a consistent low-dimensional response is found across days (t-SNE analysis in Shani-Narkiss Figure 7B). However, the key difference between the Shani-Narkiss data and the results reported here is that Shani-Narkiss also observed relative stability in the native space (Shani-Narkiss Figure 8). They conclude that they "find a relatively stable response of single neurons to odors in either awake or anesthetized states and a relatively stable representation of odors by the MC population as a whole (Figures 6-8; Bhalla and Bower, 1997)." This should be better clarified in the text.

      (3) In the discussion, the authors state that "In the MOB, individual M/T cells exhibited variable odor responses akin to gain control, altering their firing rate magnitudes over time. This is consistent with earlier experimental studies using calcium-imaging." (L314-6). Again, I disagree that these data are consistent with what has been published thus far. Changes in gain would have resulted in increased variability across days in the Bhalla data. Moreover, changes in gain would be captured by Kato's change index ("To quantify the changes in mitral cell responses, we calculated the change index (CI) for each responsive mitral cell-odor pair on each trial (trial X) of a given day as (response on trial X - the initial response on day 1)/(response on trial X + the initial response on day 1). Thus, CI ranges from −1 to 1, where a value of −1 represents a complete loss of response, 1 represents the emergence of a new response, and 0 represents no change." Kato et al.). This index will capture changes in gain. However, as shown in Figure 4D (red traces), Figure 6C (Recovery and Odor set B during odor set A experience and vice versa), the change index is either zero or near zero. If the authors wish to claim that their model is consistent with these data, they should also compute Kato's change index for M/T odor-cell pairs in their model and show that it also remains at 0 over time, absent experience.

    1. Reviewer #2 (Public review):

      Summary:

      This paper addresses an interesting issue: how is the search for a visual target affected by its orientation (and the viewer's) relative to other items in the scene and gravity? The paper describes a series of visual search tasks, using recognizable targets (e.g., a cat) positioned within a natural scene. Reaction times and accuracy at determining whether the target was present or absent, trial-to-trial, were measured as the target's orientation, that of the context, and of the viewer themselves (via rotation in a flight simulator) were manipulated. The paper concludes that search is substantially affected by these manipulations, primarily by the reference frame of gravity, then visual context, followed by the egocentric reference frame.

      Strengths:

      This work is on an interesting topic, and benefits from using natural stimuli in VR / flight simulator to change participants' POV and body position.

      Weaknesses:

      There are several areas of weakness that I feel should be addressed.

      (1) The literature review/introduction seems to be lacking in some areas. The authors, when contemplating the behavioral consequences of searching for a 'rotated' target, immediately frame the problem as one of rotation, per se (i.e., contrasting only rotation-based explanations; "what rotates and in which 'reference frame[s]' in order to allow for successful search?"). For a reader not already committed to this framing, many natural questions arise that are worth addressing.

      1a) Why do we need to appeal to rotation at all as opposed to, say, familiarity? A rotated cat is less familiar than a typically oriented one. This is a long-standing literature (e.g., Wang, Cavanagh, and Green (1994)), of course, with a lot to unpack.

      1b) What are the triggers for the 'corrective' rotation that presumably brings reference frames back into alignment? What if the rotation had not been so obvious (i.e. for a target that may not have a typical orientation, like a hand, or a ball, or a learned, nonsense object?) or the background had not had such clear orientation (like a cluttered non-naturalistic background of or a naturalistic backdrop, but viewed from an unfamiliar POV (e.g., from above) or a naturalistic background, but not all of the elements were rotated)? What, ultimately, is rotated? The entire visual field? Does that mean that searching for multiple targets at different angles of rotation would interfere with one another?

      1c) Relatedly, what is the process by which the visual system comes to know the 'correct' rotation? (Or, alternatively, is 'triggered to realize' that there is a rotation in play?) Is this something that needs to be learned? Is it only learned developmentally, through exposure to gravity? Could it be learned in the context of an experiment that starts with unfamiliar stimuli?

      1d) Why the appeal to natural images? I appreciate any time a study can be moved from potentially too stripped-down laboratory conditions to more naturalistic ones, but is this necessary in the present case? Would the pattern of results have been different if these were typical laboratory 'visual search' displays of disconnected object arrays?

      1e) How should we reconcile rotation-based theories of 'rotated-object' search with visual search results from zero gravity environments (e.g., for a review, see Leone (1998))?

      1f) How should we reconcile the current manipulations with other viewpoint-perspective manipulations (e.g., Zhang & Pan (2022))?

      (2) The presentation/interpretation of results would benefit from more elaboration and justification.

      2a) All of the current interpretations rely on just the RT data. First, the RT results should also be presented in natural units (i.e., seconds/ms), not normalized. As well, results should be shown as violin plots or something similar that captures distribution - a lot of important information is lost when just presenting one 'average' dot across participants. More fundamentally, I think we need to have a better accounting for performance (percent correct or d') to help contextualize the RT results. We should at least be offered some visualization (Heitz, 2014) of the speed accuracy trade-off for each of the conditions. Following this, the authors should more critically evaluate how any substantial SAT trends could affect the interpretation of results.

      2b) Unless I am missing something, the interpretation of the pattern of results (both qualitatively and quantitatively in their 'relative weight' analysis) relies on how they draw their contrasts. For instance, the authors contrast the two 'gravitational' conditions (target 0 deg versus target 90 deg) as if this were a change in a single variable/factor. But there are other ways to understand these manipulations that would affect contrasts. For instance, if one considers whether the target was 'consistent' (i.e., typically oriented) with respect to the context, egocentric, and gravitational frames, then the 'gravitational 0 deg' condition is consistent with context, egocentric view, but inconsistent with gravity. And, the 'gravitational 90 deg' condition, then, is inconsistent with context, egocentric view, but consistent with gravity. Seen this way, this is not a change in one variable, but three. The same is true of the baseline 0 deg versus baseline 90 deg condition, where again we have a change in all three target-consistency variables. The 'one variable' manipulations then would be: 1) baseline 0 versus visual context 0 (i.e., a change only in the context variable); 2) baseline 0 versus egocentric 0 (a change only in the egocentric variable); and 3) baseline 0 versus gravitational 0 (a change only in the gravitational variable). Other contrasts (e.g., gravitational 90 versus context 90) would showcase a change in two variables (in this case, a change in both context and gravity). My larger point is, again, unless I am really missing something, that the choice of how to contrast the manipulations will affect the 'pattern' of results and thereby the interpretation. If the authors agree, this needs to be acknowledged, plausible alternative schemes discussed, and the ultimate choice of scheme defended as the most valid.

      2c) Even with this 'relative weight' interpretation, there are still some patterns of results that seem hard to account for. Primarily, the egocentric condition seems hard to account for under any scheme, and the authors need to spend more time discussing/reconciling those results.

      2d) Some results are just deeply counterintuitive, and so the reader will crave further discussion. Most saliently for me, based on the results of Experiment 2 (specifically, the fact that gravitational 90 had better performance than gravitational 0), designers of cockpits should have all gauges/displays rotate counter to the airplane so that they are always consistent with gravity, not the pilot. Is this indeed a fair implication of the results?

      2e) I really craved some 'control conditions' here to help frame the current results. In keeping with the rhetorical questions posed above in 1a/b/c/d, if/when the authors engage with revisions to this paper, I would encourage the inclusion of at least some new empirical results. For me the most critical would be to repeat some core conditions, but with a symmetric target (e.g. a ball) since that would seem to be the only way (given the current design) to tease out nuisance confounding factors such as, say, the general effect of performing search while sideways (put another way, the authors would have to assume here that search (non-normalized RT's and search performance) for a ball-target in the baseline condition would be identical to that in the gravitational condition.)

  2. social-media-ethics-automation.github.io social-media-ethics-automation.github.io
    1. When someone presents themselves as open and as sharing their vulnerabilities with us, it makes the connection feel authentic. We feel like they have entangled their wellbeing with ours by sharing their vulnerabilities with us. Think about how this works with celebrity personalities. Jennifer Lawrence became a favorite of many when she tripped at the Oscars [f2], and turned the moment into her persona as someone with a cool-girl, unpolished, unfiltered way about her. She came across as relatable and as sharing her vulnerabilities with us, which let many people feel that they had a closer, more authentic connection with her. Over time, that persona has come to be read differently, with some suggesting that this open-styled persona is in itself also a performance. Does this mean that her performance of vulnerability was inauthentic?

      This chapter about authenticity really make me reflect on the current "performative" male trend. As you may know, the stereotype for these performative males goes along the lines of things like drinking matcha, wearing tote bags, listening to indie music like Clario... etc. In hindsight, you can chop this up as just ones interests, regardless of their gender. But the reason it's such a big trend is because people can sense when a guy is doing it purely for validation. More specifically- female validation, since these interests are more stereotypically women's interests. So like the text reads, "humans do not like to be duped", and when people can tell something is inauthentic, they're not going to take it seriously.

    1. Although it is increasingly recognised that the tools we use to examine our objects of study change our relationship to them, this is not an area that has been studied in any great detail within Digital Archaeology beyond perhaps discussions of the effects of different categories of software (the impact of GIS or database applications, for instance, or the effect of enlarged access to open data sources) on how we organise and understand the past. I have suggested elsewhere that through understanding how these technologies operate on us as well as for us, we can seek to ensure that they serve us better in what as archaeologists we already do, and help us initiate new and innovative ways of thinking about the past (Huggett 2004; 2012a). This entails going beyond the relatively commonplace reflections on specific software applications and their context of use: the tools we create, adopt, refine and employ have the effect of augmenting and scaffolding our thought and analysis, and consequently I have argued that they need to be approached in a considered, aware, and knowledgeable manner.

      it is highlights how the digital tools we use do more than organize data—they actively shape how we think about and interpret the past. He suggests that technologies “operate on us as well as for us,” meaning they influence not only the results of our research but also the cognitive processes that produce those results. This idea connects directly to my project on Tang poetry and emotion. When I use computational methods such as Voyant Tools and SnowNLP to analyze the emotional vocabulary of poems from the Tang dynasty, these tools shape the patterns I see and the questions I ask. For example, frequency counts or sentiment scores may emphasize some emotions while downplaying others that are culturally embedded in Chinese language and history. Therefore, as Huggett proposes, I must approach these technologies consciously and critically. They can scaffold my thought by helping me visualize large poetic patterns, but they can also reshape my understanding of the texts I study. This awareness encourages me to balance quantitative data with close reading and historical sensitivity, ensuring that the digital analysis deepens rather than distorts my interpretation of Tang emotional expression.

    1. Author response:

      The following is the authors’ response to the original reviews

      General Statements:

      In our manuscript, we demonstrate for the first time that RNA Polymerase I (Pol I) can prematurely release nascent transcripts at the 5' end of ribosomal DNA transcription units in vivo. This achievement was made possible by comparing wild-type Pol I with a mutant form of Pol I, hereafter called SuperPol previously isolated in our lab (Darrière at al., 2019). By combining in vivo analysis of rRNA synthesis (using pulse-labelling of nascent transcript and cross-linking of nascent transcript - CRAC) with in vitro analysis, we could show that Superpol reduced premature transcript release due to altered elongation dynamics and reduced RNA cleavage activity. Such premature release could reflect regulatory mechanisms controlling rRNA synthesis. Importantly, This increased processivity of SuperPol is correlated with resistance with BMH-21, a novel anticancer drugs inhibiting Pol I, showing the relevance of targeting Pol I during transcriptional pauses to kill cancer cells. This work offers critical insights into Pol I dynamics, rRNA transcription regulation, and implications for cancer therapeutics.

      We sincerely thank the three reviewers for their insightful comments and recognition of the strengths and weaknesses of our study. Their acknowledgment of our rigorous methodology, the relevance of our findings on rRNA transcription regulation, and the significant enzymatic properties of the SuperPol mutant is highly appreciated. We are particularly grateful for their appreciation of the potential scientific impact of this work. Additionally, we value the reviewer’s suggestion that this article could address a broad scientific community, including in transcription biology and cancer therapy research. These encouraging remarks motivate us to refine and expand upon our findings further.

      All three reviewers acknowledged the increased processivity of SuperPol compared to its wildtype counterpart. However, two out of three questions our claims that premature termination of transcription can regulate ribosomal RNA transcription. This conclusion is based on SuperPol mutant increasing rRNA production. Proving that modulation of early transcription termination is used to regulate rRNA production under physiological conditions is beyond the scope of this study. Therefore, we propose to change the title of this manuscript to focus on what we have unambiguously demonstrated:

      “Ribosomal RNA synthesis by RNA polymerase I is subjected to premature termination of transcription”.

      Reviewer 1 main criticisms centers on the use of the CRAC technique in our study. While we address this point in detail below, we would like to emphasize that, although we agree with the reviewer’s comments regarding its application to Pol II studies, by limiting contamination with mature rRNA, CRAC remains the only suitable method for studying Pol I elongation over the entire transcription units. All other methods are massively contaminated with fragments of mature RNA which prevents any quantitative analysis of read distribution within rDNA.  This perspective is widely accepted within the Pol I research community, as CRAC provides a robust approach to capturing transcriptional dynamics specific to Pol I activity. 

      We hope that these findings will resonate with the readership of your journal and contribute significantly to advancing discussions in transcription biology and related fields.

      Description of the planned revisions:

      Despite numerous text modification (see below), we agree that one major point of discussion is the consequence of increased processivity in SuperPol mutant on the “quality” of produced rRNA. Reviewer 3 suggested comparisons with other processive alleles, such as the rpb1-E1103G mutant of the RNAPII subunit (Malagon et al., 2006). This comparison has already been addressed by the Schneider lab (Viktorovskaya OV, Cell Rep., 2013 - PMID: 23994471), which explored Pol II (rpb1-E1103G) and Pol I (rpa190-E1224G). The rpa190-E1224G mutant revealed enhanced pausing in vitro, highlighting key differences between Pol I and Pol II catalytic ratelimiting steps (see David Schneider's review on this topic for further details).

      Reviewer 2 and 3 suggested that a decreased efficiency of cleavage upon backtracking might imply an increased error rate in SuperPol compared to the wild-type enzyme. Pol I mutant with decreased rRNA cleavage have been characterized previously, and resulted in increased errorrate. We already started to address this point. Preliminary results from in vitro experiments suggest that SuperPol mutants exhibit an elevated error rate during transcription. However, these findings remain preliminary and require further experimental validation to confirm their reproducibility and robustness. We propose to consolidate these data and incorporate into the manuscript to address this question comprehensively. This could provide valuable insights into the mechanistic differences between SuperPol and the wild-type enzyme. SuperPol is the first pol I mutant described with an increased processivity in vitro and in vivo, and we agree that this might be at the cost of a decreased fidelity.

      Regulatory aspect of the process:

      To address the reviewer’s remarks, we propose to test our model by performing experiments that would evaluate PTT levels in Pol I mutant’s or under different growth conditions. These experiments would provide crucial data to support our model, which suggests that PTT is a regulatory element of Pol I transcription. By demonstrating how PTT varies with environmental factors, we aim to strengthen the hypothesis that premature termination plays an important role in regulating Pol I activity.

      We propose revising the title and conclusions of the manuscript. The updated version will better reflect the study's focus and temper claims regarding the regulatory aspects of termination events, while maintaining the value of our proposed model.

      Description of the revisions that have already been incorporated in the transferred manuscript:

      Some very important modifications have now been incorporated:

      Statistical Analyses and CRAC Replicates:

      Unlike reviewers 2 and 3, reviewer 1 suggests that we did not analyze the results statistically. In fact, the CRAC analyses were conducted in biological triplicate, ensuring robustness and reproducibility. The statistical analyses are presented in Figure 2C, which highlights significant findings supporting the fact WT Pol I and SuperPol distribution profiles are different. We CRAC replicates exhibit a high correlation and we confirmed significant effect in each region of interest (5’ETS, 18S.2, 25S.1 and 3’ ETS, Figure 1) to confirm consistency across experiments. We finally took care not to overinterpret the results, maintaining a rigorous and cautious approach in our analysis to ensure accurate conclusions.

      CRAC vs. Net-seq:

      Reviewer 1 ask to comment differences between CRAC and Net-seq. Both methods complement each other but serve different purposes depending on the biological question on the context of transcription analysis. Net-seq has originally been designed for Pol II analysis. It captures nascent RNAs but does not eliminate mature ribosomal RNAs (rRNAs), leading to high levels of contamination. While this is manageable for Pol II analysis (in silico elimination of reads corresponding to rRNAs), it poses a significant problem for Pol I due to the dominance of rRNAs (60% of total RNAs in yeast), which share sequences with nascent Pol I transcripts. As a result, large Net-seq peaks are observed at mature rRNA extremities (Clarke 2018, Jacobs 2022). This limits the interpretation of the results to the short lived pre-rRNA species. In contrast, CRAC has been specifically adapted by the laboratory of David Tollervey to map Pol I distribution while minimizing contamination from mature rRNAs (The CRAC protocol used exclusively recovers RNAs with 3′ hydroxyl groups that represent endogenous 3′ ends of nascent transcripts, thus removing RNAs with 3’-Phosphate, found in mature rRNAs). This makes CRAC more suitable for studying Pol I transcription, including polymerase pausing and distribution along rDNA, providing quantitative dataset for the entire rDNA gene.

      CRAC vs. Other Methods:

      Reviewer 1 suggests using GRO-seq or TT-seq, but the experiments in Figure 2 aim to assess the distribution profile of Pol I along the rDNA, which requires a method optimized for this specific purpose. While GRO-seq and TT-seq are excellent for measuring RNA synthesis and cotranscriptional processing, they rely on Sarkosyl treatment to permeabilize cellular and nuclear membranes. Sarkosyl is known to artificially induces polymerase pausing and inhibits RNase activities which are involved in the process. To avoid these artifacts, CRAC analysis is a direct and fully in vivo approach. In CRAC experiment, cells are grown exponentially in rich media and arrested via rapid cross-linking, providing precise and artifact-free data on Pol I activity and pausing.

      Pol I ChIP Signal Comparison:

      The ChIP experiments previously published in Darrière et al. lack the statistical depth and resolution offered by our CRAC analyses. The detailed results obtained through CRAC would have been impossible to detect using classical ChIP. The current study provides a more refined and precise understanding of Pol I distribution and dynamics, highlighting the advantages of CRAC over traditional methods in addressing these complex transcriptional processes.

      BMH-21 Effects:

      As highlighted by Reviewer 1, the effects of BMH-21 observed in our study differ slightly from those reported in earlier work (Ref Schneider 2022), likely due to variations in experimental conditions, such as methodologies (CRAC vs. Net-seq), as discussed earlier. We also identified variations in the response to BMH-21 treatment associated with differences in cell growth phases and/or cell density. These factors likely contribute to the observed discrepancies, offering a potential explanation for the variations between our findings and those reported in previous studies. In our approach, we prioritized reproducibility by carefully controlling BMH-21 experimental conditions to mitigate these factors. These variables can significantly influence results, potentially leading to subtle discrepancies. Nevertheless, the overall conclusions regarding BMH-21's effects on WT Pol I are largely consistent across studies, with differences primarily observed at the nucleotide resolution. This is a strength of our CRAC-based analysis, which provides precise insights into Pol I activity.

      We will address these nuances in the revised manuscript to clarify how such differences may impact results and provide context for interpreting our findings in light of previous studies.

      Minor points:

      Reviewer #1:

      In general, the writing style is not clear, and there are some word mistakes or poor descriptions of the results, for example: 

      On page 14: "SuperPol accumulation is decreased (compared to Pol I)". 

      On page 16: "Compared to WT Pol I, the cumulative distribution of SuperPol is indeed shifted on the right of the graph." 

      We clarified and increased the global writing style according to reviewer comment.

      There are also issues with the literature, for example: Turowski et al, 2020a and Turowski et al, 2020b are the same article (preprint and peer-reviewed). Is there any reason to include both references? Please, double-check the references.  

      This was corrected in this version of the manuscript.

      In the manuscript, 5S rRNA is mentioned as an internal control for TMA normalisation. Why are Figure 1C data normalised to 18S rRNA instead of 5S rRNA? 

      Data are effectively normalized relative to the 5S rRNA, but the value for the 18S rRNA is arbitrarily set to 100%.

      Figure 4 should be a supplementary figure, and Figure 7D doesn't have a y-axis labelling. 

      The presence of all Pol I specific subunits (Rpa12, Rpa34 and Rpa49) is crucial for the enzymatic activity we performed. In the absence of these subunits (which can vary depending on the purification batch), Pol I pausing, cleavage and elongation are known to be affected. To strengthen our conclusion, we really wanted to show the subunit composition of the purified enzyme. This important control should be shown, but can indeed be shown in a supplementary figure if desired.

      Y-axis is figure 7D is now correctly labelled

      In Figure 7C, BMH-21 treatment causes the accumulation of ~140bp rRNA transcripts only in SuperPol-expressing cells that are Rrp6-sensitive (line 6 vs line 8), suggesting that BHM-21 treatment does affect SuperPol. Could the author comment on the interpretation of this result? 

      The 140 nt product is a degradation fragment resulting from trimming, which explains its lower accumulation in the absence of Rrp6. BMH21 significantly affects WT Pol I transcription but has also a mild effect on SuperPol transcription. As a result, the 140 nt product accumulates under these conditions.

      Reviewer #2:

      pp. 14-15: The authors note local differences in peak detection in the 5'-ETS among replicates, preventing a nucleotide-resolution analysis of pausing sites. Still, they report consistent global differences between wild-type and SuperPol CRAC signals in the 5'ETS (and other regions of the rDNA). These global differences are clear in the quantification shown in Figures 2B-C. A simpler statement might be less confusing, avoiding references to a "first and second set of replicates" 

      According to reviewer, statement has been simplified in this version of the manuscript.

      Figures 2A and 2C: Based on these data and quantification, it appears that SuperPol signals in the body and 3' end of the rDNA unit are higher than those in the wild type. This finding supports the conclusion that reduced pausing (and termination) in the 5'ETS leads to an increased Pol I signal downstream. Since the average increase in the SuperPol signal is distributed over a larger region, this might also explain why even a relatively modest decrease in 5'ETS pausing results in higher rRNA production. This point merits discussion by the authors. 

      We agree that this is a very important discussion of our results. Transcription is a very dynamic process in which paused polymerase is easily detected using the CRAC assay. Elongated polymerases are distributed over a much larger gene body, and even a small amount of polymerase detected in the gene body can represent a very large rRNA synthesis. This point is of paramount importance and, as suggested by the reviewer, is now discussed in detail.

      A decreased efficiency of cleavage upon backtracking might imply an increased error rate in SuperPol compared to the wild-type enzyme. Have the authors observed any evidence supporting this possibility? 

      Reviewer suggested that a decreased efficiency of cleavage upon backtracking might imply an increased error rate in SuperPol compared to the wild-type enzyme. We thank Reviewer #2 to point it as in our opinion, this is an important point what should be added to the manuscript. We have now included new data (panels 5G, 5H and 5I) in the manuscript showing that SuperPol in vitro exhibits an increased error rate compared to the WT enzyme. From these results obtained in vitro, we concluded that SuperPol shows reduced nascent transcript cleavage, associated with more efficient transcript elongation, but to the detriment of transcriptional fidelity.

      pp. 15 and 22: Premature transcription termination as a regulator of gene expression is welldocumented in yeast, with significant contributions from the Corden, Brow, Libri, and Tollervey labs. These studies should be referenced along with relevant bacterial and mammalian research. 

      According to reviewer suggestion, we referenced these studies.

      p. 23: "SuperPol and Rpa190-KR have a synergistic effect on BMH-21 resistance." A citation should be added for this statement. 

      This represents some unpublished data from our lab. KR and SuperPol are the only two known mutants resistant to BMH-21. We observed that resistance between both alleles is synergistic, with a much higher resistance to BMH-21 in the double mutant than in each single mutant (data not shown). Comparing their resistance mechanisms is a very important point that we could provide upon request. This was added to the statement.

      p. 23: "The released of the premature transcript" - this phrase contains a typo 

      This is now corrected.

      Reviewer #3:

      Figure 1B: it would be opportune to separate the technique's schematic representation from the actual data. Concerning the data, would the authors consider adding an experiment with rrp6D cells? Some RNAs could be degraded even in such short period of time, as even stated by the authors, so maybe an exosome depleted background could provide a more complete picture. Could also the authors explain why the increase is only observed at the level of 18S and 25S? To further prove the robustness of the Pol I TMA method could be good to add already characterized mutations or other drugs to show that the technique can readily detect also well-known and expected changes. 

      The precise objective of this experiment is to avoid the use of the Rrp6 mutant. Under these conditions, we prevent the accumulation of transcripts that would result from a maturation defect. While it is possible to conduct the experiment with the Rrp6 mutant, it would be impossible to draw reliable conclusions due to this artificial accumulation of transcripts.

      Figure 1C: the NTS1 probe signal is missing (it is referenced in Figure 1A but not listed in the Methods section or the oligo table). If this probe was unused, please correct Figure 1A accordingly. 

      We corrected Figure 1A.  

      Figure 2A: the RNAPI occupancy map by CRAC is hard to interpret. The red color (SuperPol) is stacked on top of the blue line, and we are not able to observe the signal of the WT for most of the position along the rDNA unit. It would be preferable to use some kind of opacity that allows to visualize both curves. Moreover, the analysis of the behavior of the polymerase is always restricted to the 5'ETS region in the rest of the manuscript. We are thus not able to observe whether termination events also occur in other regions of the rDNA unit. A Northern blot analysis displaying higher sizes would provide a more complete picture. 

      We addressed this point to make the figure more visually informative. In Northern Blot analysis, we use a TSS (Transcription Start Site) probe, which detects only transcripts containing the 5' extremity. Due to co-transcriptional processing, most of the rRNA undergoing transcription lacks its 5' extremity and is not detectable using this technique. We have the data, but it does not show any difference between Pol I and SuperPol. This information could be included in the supplementary data if asked.

      "Importantly, despite some local variations, we could reproducibly observe an increased occupancy of WT Pol I in 5'-ETS compared to SuperPol (Figure 1C)." should be Figure 2C. 

      Thanks for pointing out this mistake. It has been corrected.

      Figure 3D: most of the difference in the cumulative proportion of CRAC reads is observed in the region ~750 to 3000. In line with my previous point, I think it would be worth exploring also termination events beyond the 5'-ETS region. 

      We agree that such an analysis would have been interesting. However, with the exception of the pre-rRNA starting at the transcription start site (TSS) studied here, any cleaved rRNA at its 5' end could result from premature termination and/or abnormal processing events. Exploring the production of other abnormal rRNAs produced by premature termination is a project in itself, beyond this initial work aimed at demonstrating the existence of premature termination events in ribosomal RNA production.

      Figure 4: should probably be provided as supplementary material. 

      As l mentioned earlier (see comments), the presence of all Pol I specific subunits (Rpa12, Rpa34 and Rpa49) is crucial for the enzymatic activity we performed. This important control should be shown, but can indeed be shown in a supplementary figure if desired.

      "While the growth of cells expressing SuperPol appeared unaffected, the fitness of WT cells was severely reduced under the same conditions." I think the growth of cells expressing SuperPol is slightly affected. 

      We agree with this comment and we modified the text accordingly.

      Figure 7D: the legend of the y-axis is missing as well as the title of the plot. 

      Legend of the y-axis and title of the plot are now present.

      The statements concerning BMH-21, SuperPol and Rpa190-KR in the Discussion section should be removed, or data should be provided.

      This was discussed previously. See comment above.

      Some references are missing from the Bibliography, for example Merkl et al., 2020; Pilsl et al., 2016a, 2016b. 

      Bibliography is now fixed

      Description of analyses that authors prefer not to carry out:

      Does SuperPol mutant produces more functional rRNAs ?

      As Reviewer 1 requested, we agree that this point requires clarification.. In cells expressing SuperPol, a higher steady state of (pre)-rRNAs is only observed in absence of degradation machinery suggesting that overproduced rRNAs are rapidly eliminated. We know that (pre)rRNas are unable to accumulate in absence of ribosomal proteins and/or Assembly Factors (AF). In consequence, overproducing rRNAs would not be sufficient to increase ribosome content. This specific point is further address in our lab but is beyond the scope of this article.

      Is premature termination coupled with rRNA processing 

      We appreciate the reviewer’s insightful comments. The suggested experiments regarding the UTP-A complex's regulatory potential are valuable and ongoing in our lab, but they extend beyond the scope of this study and are not suitable for inclusion in the current manuscript.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      Summary:

      Liu et al., present glmSMA, a network-regularized linear model that integrates single-cell RNA-seq data with spatial transcriptomics, enabling high-resolution mapping of cellular locations across diverse datasets. Its dual regularization framework (L1 for sparsity and generalized L2 via a graph Laplacian for spatial smoothness) demonstrates robust performance of their model and offers novel tools for spatial biology, despite some gaps in fully addressing spatial communication.

      Overall, the manuscript is commendable for its comprehensive benchmarking across different spatial omics platforms and its novel application of regularized linear models for cell mapping. I think this manuscript can be improved by addressing method assumptions, expanding the discussion on feature dependence and cell type-specific biases, and clarifying the mechanism of spatial communication.

      The conclusions of this paper are mostly well supported by data, but some aspects of model developmentand performance evaluation need to be clarified and extended.

      We are thankful for the positive comments and have made changes following the reviewer's advice, as detailed below.

      (1) What were the assumptions made behind the model? One of them could be the linear relationship between cellular gene expression and spatial location. In complex biological tissues, non-linear relationships could be present, and this would also vary across organ systems and species. Similarly, with regularization parameters, they can be tuned to balance sparsity and smoothness adequately but may not hold uniformly across different tissue types or data quality levels. The model also seems to assume independent errors with normal distribution and linear additive effects - a simplification that may overlook overdispersion or heteroscedasticity commonly observed in RNA-seq data.

      Thank you for this comment. We acknowledge that the non-linear relationships can be present in complex tissues and may not be fully captured by a linear model. 

      Our choice of a linear model was guided by an investigation of the relationship in the current datasets, which include intestinal villus, mouse brain, and fly embryo.There is a linear correlation between expression distance and physical distance [Nitzan et al]. Within a given anatomical structure, cells in closer proximity exhibit more similar expression patterns (Fig. 3c). In tissues where non-linear relationships are more prevalent—such as the human PDAC sample—our mapping results remain robust. We acknowledge that we have not yet tested our algorithm in highly heterogeneous regions like the liver, and we plan to include such analyses in future work if necessary.

      Regarding the regularization parameters, we agree that the balance between sparsity and smoothness is sensitive to tissue-specific variation and data quality. In our current implementation, we explored a range of values to find robust defaults. Supplementary Figure 7 illustrates the regularization path for cell assignment in the fly embryo.  

      The choice of L1 and L2 regularization parameters is crucial for balancing sparsity and smoothness in spatial mapping. 

      For Structured Tissues (brain):

      Moderate L1 to ensure cells are localized.

      Small to moderate L2 to maintain local smoothness without blurring distinct regions.

      For Less Structured (PDAC):

      Slightly lower L1 to allow cells to be associated with multiple regions if boundaries are ambiguous.

      Higher L2 to stabilize mappings in noisy or mixed regions.

      (2) The performance of glmSMA is likely sensitive to the number and quality of features used. With too few features, the model may struggle to anchor cells correctly due to insufficient discriminatory power, whereas too many features could lead to overfitting unless appropriately regularized. The manuscript briefly acknowledges this issue, but further systematic evaluation of how varying feature numbers affect mapping accuracy would strengthen the claims, particularly in settings where marker gene availability is limited. A simple way to show some of this would be testing on multiple spatial omics (imaging-based) platforms with varying panel sizes and organ systems. Related to this, based on the figures, it also seems like the performance varies by cell type. What are the factors that contribute to this? Variability in expression levels, RNA quantity/quality? Biases in the panel? Personally, I am also curious how this model can be used similarly/differently if we have a FISH-based, high-plex reference atlas. Additional explanation around these points would be helpful for the readers.

      Thank you for this thoughtful comment. The performance of our method is indeed sensitive to the number and quality of selected features. To optimize feature selection, we employed multiple strategies, including Moran’s I statistic, identification of highly variable genes, and the Seurat pipeline to detect anchor genes linking the spatial transcriptomics data with the reference atlas. The number of selected markers depends on the quality of the data. For highquality datasets, fewer than 100 markers are typically sufficient for prediction. To select marker genes, we applied the following optional strategies:

      (1) Identifying highly variable genes (HVGs).

      (2) Calculating Moran’s I scores for all genes to assess spatial autocorrelation.

      (3) Generating anchor genes based on the integration of the reference atlas and scRNA-seq data using Seurat.

      We evaluated our method across diverse tissue types and platforms—including Slide-seq, 10x Visium, and Virtual-FISH—which represent both sequencing-based and imaging-based spatial transcriptomics technologies. Our model consistently achieved strong performance across these settings. It's worth noting that the performance of other methods, such as CellTrek [Wei et al] and novoSpaRc [Nitzan et al], also depends heavily on feature selection. In particular, performance degrades substantially when fewer features are used. For fair comparison across different methods, the same set of marker genes was used. Under this condition, our method outperformed the others based on KL divergence (Fig. 2b, Fig. 5g). 

      To assess the effect of marker gene quantity, we randomly selected subsets of 2,000, 1500, 1,000, 700, 500, and 200 markers from the original set. As the number of markers decreases, mapping performance declines, which is expected due to the reduction in available spatial information. This result underscores the general dependence of spatial mapping accuracy on both the number and quality of informative marker genes (Supplementary Fig. 10).

      We do not believe that the observed performance is directly influenced by cell type composition. Major cell types are typically well-defined, and rare cell types comprise only a small fraction of the dataset. For these rare populations, a single misclassification can disproportionately impact metrics like KL divergence due to small sample size. However, this does not necessarily indicate a systematic cell type–specific bias in the mapping. We incorporated a high-resolution Slide-seq dataset from the mouse hippocampus to evaluate the influence of cell type composition on the algorithm’s performance [Stickels et al., 2020]. Most cell types within the CA1, CA2, CA3, and DG regions were accurately mapped to their original anatomical locations (Fig. 5e, f, g).

      (3) Application 3 (spatial communication) in the graphical abstract appears relatively underdeveloped. While it is clear that the model infers spatial proximities, further explanation of how these mappings translate into insights into cell-cell communication networks would enhance the biological relevance of the findings.

      Thank you for this valuable feedback. We agree that further elaboration on the connection between spatial proximity and cell–cell communication would enhance the biological interpretation of our results. While our current model focuses on inferring spatial relationships,  we may provide some cell-cell communications in the future.

      (4) What is the final resolution of the model outputs? I am assuming this is dictated by the granularity of the reference atlas and the imposed sparsity via the L1 norm, but if there are clear examples that would be good. In figures (or maybe in practice too), cells seem to be assigned to small, contiguous patches rather than pinpoint single-cell locations, which is a pragmatic compromise given the inherent limitations of current spatial transcriptomics technologies. Clarification on the precise spatial scale (e.g., pixel or micrometer resolution) and any post-mapping refinement steps would be beneficial for the users to make informed decisions on the right bioinformatic tools to use.

      Thank you for the comment. For each cell, our algorithm generates a probability vector that indicates its likely spatial assignment along with coordinate information. In our framework, each cell is mapped to one or more spatial spots with associated probabilities. Depending on the amount of regularization through L1 and L2 norms, a cell may be localized to a small patch or distributed over a broader domain (Supplementary Fig. 5 & 7). For the 10x Visium data, we applied a repelling algorithm to enhance visualization [Wei et al]. If a cell’s original location is already occupied, it is reassigned to a nearby neighborhood to avoid overlap. The users can also see the entire regularization path by varying the penalty terms. 

      Nitzan M, Karaiskos N, Friedman N, Rajewsky N. Gene expression cartography. Nature. 2019;576(7785):132-137. doi:10.1038/s41586-019-1773-3

      Wei, R. et al. (2022) ‘Spatial charting of single-cell transcriptomes in tissues’, Nature Biotechnology, 40(8), pp. 1190–1199. doi:10.1038/s41587-022-01233-1.

      Stickels, R.R. et al. (2020) ‘Highly sensitive spatial transcriptomics at near-cellular resolution with Slide-SEQV2’, Nature Biotechnology, 39(3), pp. 313–319. doi:10.1038/s41587-020-0739-1. 

      Reviewer #2 (Public review):

      Summary:

      The author proposes a novel method for mapping single-cell data to specific locations with higher resolution than several existing tools.

      Strengths:

      The spatial mapping tests were conducted on various tissues, including the mouse cortex, human PDAC, and intestinal villus.

      Weakness:

      (1) Although the researchers claim that glmSMA seamlessly accommodates both sequencing-based and image-based spatial transcriptomics (ST) data, their testing primarily focused on sequencingbased ST data, such as Visium and Slide-seq. To demonstrate its versatility for spatial analysis, the authors should extend their evaluation to imaging-based spatial data.

      Thank you for the comment. We have tested our algorithm on the virtual FISH dataset from the fly embryo, which serves as an example of image-based spatial omics data (Fig. 4c). However, such datasets often contain a limited number of available genes. To address this, we will conduct additional testing on image-based data if needed. The Allen Brain Atlas provides high-quality ISH data, and we can select specific brain regions from this resource to further evaluate our algorithm if necessary [Lein et al]. Currently, we plan to focus more on the 10x Visium platform, as it supports whole-transcriptome profiling and offers a wide range of tissue samples for analysis.

      (2) The definition of "ground truth" for spatial distribution is unclear. A more detailed explanation is needed on how the "ground truth" was established for each spatial dataset and how it was utilized for comparison with the predicted distribution generated by various spatial mapping tools.

      Thank you for the comment. To clarify how ground truth is defined across different tissues, we provided the following details. Direct ground truth for cell locations is often unavailable in scRNA-seq data due to experimental constraints. To address this, we adopted alternative strategies for estimating ground truth in each dataset:

      10x Visium Data: We used the cell type distribution derived from spatial transcriptomics (ST) data as a proxy for ground truth. We then computed the KL divergence between this distribution and our model's predictions for performance assessment.

      Slide-seq Data: We validated predictions by comparing the expression of marker genes between the reconstructed and original spatial data.

      Fly Embryo Data: We used predicted cell locations from novoSpaRc as a reference for evaluating our algorithm.

      These strategies allowed us to evaluate model performance even in the absence of direct cell location data. In addition, we can apply multiple evaluation strategies within a single dataset.

      (3) In the analysis of spatial mapping results using intestinal villus tissue, only Figure 3d supports their findings. The researchers should consider adding supplemental figures illustrating the spatial distribution of single cells in comparison to the ground truth distribu tion to enhance the clarity and robustness of their investigation.

      Thank you for the comment. In the intestinal dataset, only six large domains were defined. As a result, the task for this dataset is relatively simple—each cell only needs to be assigned to one of the six domains. As the intestinal villus is a relatively simple tissue, most existing algorithms performed well on it. For this reason, we did not initially provide extensive details in the main text.

      (4) The spatial mapping tests were conducted on various tissues, including the mouse cortex, human PDAC, and intestinal villus. However, the original anatomical regions are not displayed, making it difficult to directly compare them with the predicted mapping results. Providing ground truth distributions for each tested tissue would enhance clarity and facilitate interpretation. For instance, in Figure 2a and  Supplementary Figures 1 and 2, only the predicted mapping results are shown without the corresponding original spatial distribution of regions in the mouse cortex. Additionally, in Figure 3c, four anatomical regions are displayed, but it is unclear whether the figure represents the original spatial regions or those predicted by glmSMA. The authors are encouraged to clarify this by incorporating ground truth distributions for each tissue.

      Thank you for the comment. To improve visualization, we included anatomical structures alongside the mapping results in the next version, wherever such structures are available (e.g., mouse brain cortex, human PDAC sample, etc.). Major cell type assignments for the PDAC samples, along with anatomical structures, are shown in Supplementary Figure 9. Most of these cell types were correctly mapped to their corresponding anatomical regions.

      (5) The cell assignment results from the mouse hippocampus (Supplementary Figure 6) lack a corresponding ground truth distribution for comparison. DG and CA cells were evaluated solely based on the gene expression of specific marker genes. Additional analyses are needed to further validate the robustness of glmSMA's mapping performance on Slide-seq data from the mouse hippocampus.

      Thank you for the comment. The ground truth for DG and CA cells was not available. To better evaluate the model's performance, we computed the KL divergence between the original and predicted cell type distributions, following the same approach used for the 10x Visium dataset. We identified a higher-quality dataset for the mouse hippocampus and used it to evaluate our algorithm. Additionally, we employed KL divergence as an alternative strategy to validate and benchmark our results (Fig. 5e, f, g). Most CA cells, including CA1, CA2, and CA3 principal cells, were correctly assigned back to the CA region. Dentate principal cells were accurately mapped to the DG region (Fig. 5e, f).

      (6) The tested spatial datasets primarily consist of highly structured tissues with well-defined anatomical regions, such as the brain and intestinal villus. Anatomical regions are not distinctly separated, such as liver tissue. Further evaluation of such tissues would help determine the method's broader applicability.

      Thank you for the insightful comment. We agree that many spatial datasets used in our study are from tissues with well-defined anatomical regions. To address the applicability of glmSMA in tissues without clearly separated anatomical structures, we applied glmSMA to the Drosophila embryo, which represents a tissue with relatively continuous spatial patterns and lacks well-demarcated anatomical boundaries compared to organs like the brain or intestinal villus.

      Despite this less structured spatial organization, glmSMA demonstrated robust performance in the fly embryo, accurately mapping cells to their correct spatial spots based on gene expression profiles. This result indicates that glmSMA is not strictly limited to highly structured tissues and can generalize to tissues with more continuous or gradient-like spatial architectures. These results suggest that glmSMA has broader applicability beyond highly compartmentalized tissues.

      Lein, E., Hawrylycz, M., Ao, N. et al. Genome-wide atlas of gene expression in the adult mouse brain. Nature 445, 168–176 (2007). https://doi.org/10.1038/nature05453

      Reviewer #3 (Public review):

      The authors aim to develop glmSMA, a network-regularized linear model that accurately infers spatial gene expression patterns by integrating single-cell RNA sequencing data with spatial transcriptomics reference atlases. Their goal is to reconstruct the spatial organization of individual cells within tissues, overcoming the limitations of existing methods that either lack spatial resolution or sensitivity.

      Strengths:

      (1) Comprehensive Benchmarking:

      Compared against CellTrek and Novosparc, glmSMA consistently achieved lower Kullback-Leibler divergence (KL divergence) scores, indicating better cell assignment accuracy.

      Outperformed CellTrek in mouse cortex mapping (90% accuracy vs. CellTrek's 60%) and provided more spatially coherent distributions.

      (2) Experimental Validation with Multiple Real-World Datasets:

      The study used multiple biological systems (mouse brain, Drosophila embryo, human PDAC, intestinal villus) to demonstrate generalizability.

      Validation through correlation analyses, Pearson's coefficient, and KL divergence support the accuracy of glmSMA's predictions.

      We thank reviewer #3 for their positive feedback and thoughtful recommendations.

      Weaknesses:

      (1) The accuracy of glmSMA depends on the selection of marker genes, which might be limited by current FISH-based reference atlases.

      We agree that the accuracy of glmSMA is influenced by the selection of marker genes, and that current FISH-based reference atlases may offer a limited gene set. To address this, we incorporate multiple feature selection strategies, including highly variable genes and spatially informative genes (e.g., via Moran’s I), to optimize performance within the available gene space. As more comprehensive reference atlases become available, we expect the model’s accuracy to improve further.

      (2) glmSMA operates under the assumption that cells with similar gene expression profiles are likely to be physically close to each other in space which not be true under various heterogeneous environments.

      Thank you for raising this important point. We agree that glmSMA operates under the assumption that cells with similar gene expression profiles tend to be spatially proximal, and this assumption may not strictly hold in highly heterogeneous tissues where spatial organization is less coupled to transcriptional similarity.

      To address this concern, we specifically tested glmSMA on human PDAC samples, which represent moderately heterogeneous environments characterized by complex tumor microenvironments, including a mixture of ductal cells, cancer cells, stromal cells, and other components. Despite this heterogeneity, glmSMA successfully mapped major cell types to their expected anatomical regions, demonstrating that the method is robust even in the presence of substantial cellular diversity and spatial complexity.

      This result suggests that while glmSMA relies on the assumption of spatialtranscriptomic correlation, the method can tolerate a reasonable degree of spatial heterogeneity without a significant loss of performance. Nevertheless, we acknowledge that in extremely disorganized or highly mixed tissues where transcriptional similarity is decoupled from spatial proximity, the performance may be affected.

    1. Author response:

      The following is the authors’ response to the original reviews

      Reviewer #1 (Public review):

      (1) Their first major claim is that fluid flows alone must be quite strong in order to fragment the cyanobacterial aggregates they have studied. With their rheological chamber, they explicitly show that energy dissipation rates must exceed "natural" conditions by multiple orders of magnitude in order to fragment lab strain colonies, and even higher to disrupt natural strains sampled from a nearby freshwater lake. This claim is well-supported by their experiments and data.

      We thank the reviewer for this positive comment. We fully agree, as our fragmentation experiments on division-formed colonies clearly demonstrate their strong mechanical resistance in naturally occurring flows.

      (2) The authors then claim that the fragmentation of aggregates due to fluid flows occurs through erosion of small pieces. Because their experimental setup does not allow them to explicitly observe this process (for example, by watching one aggregate break into pieces), they implement an idealized model to show that the nature of the changes to the size histogram agrees with an erosion process. However, in Figure 2C there is a noticeable gap between their experiment and the prediction of their model. Additionally, in a similar experiment shown in Figure S6, the experiment cannot distinguish between an idealized erosion model and an alternative, an idealized binary fission model where aggregates split into equal halves. For these reasons, this claim is weakened.

      The two idealized models of colony fragmentation, namely erosion of single cells and fragmentation into equal sizes (or binary fission), lead to distinguishable final size distributions. We believe that our experiments for division-formed colonies support the hypothesis of the erosion mechanism. Specifically, Figure 2E shows that colony fragmentation resulted in a decrease of large colonies and a strong increase of single cells and dimers (two cells). In our view, the strong increase of single cells and dimers provides quite convincing (but indirect) evidence supporting the erosion mechanism. This is described on lines 112-121. To further address the reviewer’s concern, we have included in the revised version of Figure 2 (panels B and D) a direct comparison between these two fragmentation models for large division-formed colonies fragmented at a high dissipation rate of ε = 5.8 m<sup>2</sup>/s<sup>3</sup>. Furthermore, we have included the new Supplementary Figure S9, which details the model predictions for the colony size distribution at various time points.

      The ideal equal fragments model (i.e., where every fracture event produces two identical fragments with half the original biovolume) does not capture the biovolume transfer from large colonies to single cells, as observed for the experimental results in panel D of Figure 2 and panel E of Figure S9. In contrast, the erosion model, in panel D of Figure 2 and panel D of Figure S9, provides a good prediction of the experimental results within the experimental uncertainty. The different fragmentation models are discussed in lines 226-228 of the revised manuscript and lines 865-873 of the SI.

      (3) Their third major claim is that fluid flows only weakly cause cells to collide and adhere in a "coming together" process of aggregate formation. They test this claim in Figure 3, where they suspend single cells in their test chamber and stir them at moderate intensity, monitoring their size histogram. They show that the size histogram changes only slightly, indicating that aggregation is, by and large, not occurring at a high rate. Therefore, they lend support to the idea that cell aggregation likely does not initiate group formation in toxic cyanobacterial blooms. Additionally, they show that the median size of large colonies also does not change at moderate turbulent intensities. These results agree with previous studies (their own citation 25) indicating that aggregates in toxic blooms are clonal in nature. This is an important result and well-supported by their data, but only for this specific particle concentration and stirring intensity. Later, in Figure 5 they show a much broader range of particle concentrations and energy dissipation rates that they leave untested.

      We thank the reviewer for this positive comment. We agree that our experimental results show clear evidence that aggregated colonies have a weaker structure in comparison to division-formed colonies, thus supporting the hypothesis that clonal expansion is the main mechanism for colony formation under most natural settings. The range of energy dissipation rates of our experimental setup covers almost entirely the region for which aggregated and division-formed colonies differ in their fragmentation behavior (Zone III of Figure 5). Within this zone, aggregated colonies are fragmented and only the division-formed colonies are able to withstand the hydrodynamic stresses. Furthermore, we show that this fragmentation behavior has a low sensitivity to the total biovolume fraction, as displayed in the Supplementary Figures S2 and S4 and discussed in lines 151-154 and 160-163. We agree that our cone-and-plate setup covers a limited parameter range, and we have added a detailed discussion of these limitations in the revised manuscript, under section Materials and Methods in lines 462-473.

      (4) The fourth major result of the manuscript is displayed in Equation 8 and Figure 5, where the authors derive an expression for the ratio between the rate of increase of a colony due to aggregation vs. the rate due to cell division. They then plot this line on a phase map, altering two physical parameters (concentration and fluid turbulence) to show under what conditions aggregation vs. cell division are more important for group formation. Because these results are derived from relatively simple biophysical considerations, they have the potential to be quite powerful and useful and represent a significant conceptual advance. However, there is a region of this phase map that the authors have left untested experimentally. The lowest energy dissipation rate that the authors tested in their experiment seemed to be \dot{epsilon}~1e-2 [m^2/s^3], and the highest particle concentration they tested was 5e-4, which means that the authors never tested Zone II of their phase map. Since this seems to be an important zone for toxic blooms (i.e. the "scum formation" zone), it seems the authors have missed an important opportunity to investigate this regime of high particle concentrations and relatively weak turbulent mixing.

      We agree with the reviewer that Zone (II) of Figure 5 is of great importance to dense bloom formation under wind mixing and that this parameter range was not covered by our experiments using a cone-and-plate shear flow. The measuring range of our device was motivated by engineering applications such as artificial mixing of eutrophic lakes using bubble plumes, as well as preliminary experiments which demonstrated that high levels of dissipation rate were required to achieve fragmentation. The range of dissipation rates that can be achieved by the cone-and-plate setup is limited at the lower end by the accumulation of colonies near the stagnation point at the conical tip and at the upper end by the spillage of fluid out of the chamber. We now discuss this measuring range in lines 462-473 of the revised manuscript.

      Although our setup does not cover Zone (II), we now refer to recent results in the literature for evidence of aggregation-dominance at Zone (II). The experimental study of Wu et al. (2024) (reference number 64 of the revised manuscript) investigated the formation of Microcystis surface scum layers in wind-mixed mesocosms. Their study identified aggregation of colonies in the scum layer, resulting in increases of colony size at rates faster than cell division. These results agree with our model, and the parameters range investigated fall within the Zone II. We have included in the revised version, lines 328-337, a detailed discussion elucidating the parameter range covered in our experiments and the findings of Wu et al. (2024).

      Other items that could use more clarity:

      (5) The authors rely heavily on size distributions to make the claims of their paper. Yet, how they generated those size distributions is not clearly shown in the text. Of primary concern, the authors used a correction function (Equation S1) to estimate the counts of different size classes in their image analysis pipeline. Yet, it is unclear how well this correction function actually performs, what kinds of errors it might produce, and how well it mapped to the calibration dataset the authors used to find the fit parameters.

      We agree with the reviewer that more details of the correction function should be included. We have included in the revised version of the Supporting Information, in lines 785-796, a more detailed explanation of the correction function. Furthermore, a direct comparison of raw and corrected histograms of the size distribution and its associated uncertainty is presented in the new Supplementary Figure S8.

      (6) Second, in their models they use a fractal dimension to estimate the number of cells in the group from the group radius, but the agreement between this fractal dimension fit and the data is not shown, so it is not clear how good an approximation this fractal dimension provides. This is especially important for their later derivation of the "aggregation-to-cell division" ratio (Equation 8)

      We agree with the reviewer that more details on the estimation of fractal dimension are needed. The revised version, under Materials and Methods in lines 508-515, now includes the detailed estimation procedure, the number of colonies analysed, and the associated uncertainty.

      Reviewer #1 (Recommendations For The Authors):

      In light of the weak evidence for claim #2 outlined above, I believe the paper would benefit from a more explicit comparison in Figure 2C of the two models - idealized erosion, and idealized binary fission. With such a comparison, the authors would have stronger footing to claim that one process is more important than the other.

      As mentioned in our answer above to comment #2 of public review, we have included in the revised version of Figure 2 (panels B and D) a direct comparison between the erosion and equal fragments (binary fission) models for large division-formed colonies fragmented under ε = 5.8 m<sup>2</sup>/s<sup>3</sup>. The comparison is further detailed in the new Supplementary Figure S9 for representative time points. Only the erosion models can recover the biovolume transfer from large colonies to single cells, as observed for the experimental results in Figure 2D and further detailed in Figure S9D. We believe that the revised version of Figure 2 and the new Supplementary Figure S9 provide strong evidence in support of the erosion fragmentation model.

      Would the authors comment on their chosen range of experimental dissipation rates? For instance, was their goal more to investigate industrial/engineering applications where the goal is to disrupt the cyanobacteria, but not really typical natural conditions under which the groups might form?

      The choice of experimental dissipation rates in our experiment was such that it covers engineering applications such as artificial mixing of eutrophic lakes using bubble plumes. We have now clarified in the Introduction, on lines 37-39, that artificial mixing has been successfully applied in several lakes to suppress cyanobacterial blooms. Furthermore, we have now clarified in the caption of Figure 5 that the bars on the right side indicate typical values of dissipation rates induced by natural wind-mixing, bubble plumes in artificially mixed lakes, and laboratory-scale experiments such as cone-and-plate systems and stirred tanks. The dissipation rates induced by the bubble plumes in artificially mixed lakes could potentially fragment aggregated cyanobacterial colonies and thus disrupt bloom formation. However, our preliminary experiments demonstrated that high levels of dissipation rate were required to achieve fragmentation, therefore we’ve focused on the upper range of values (0.01 to 10 m<sup>2</sup>/s<sup>3</sup>).

      The dissipation rates generated by the cone-and-plate approach are indeed higher than the dissipation rates under typical natural conditions in lakes. We have now added a detailed discussion of the range of dissipation rates generated by the cone-and-plate approach in the revised manuscript, under section Materials and Methods in lines 462-473, where we also explain that these values are higher than the natural dissipation rates generated by wind action in lakes. However, the more generic insights obtained by our study, shown in Figure 5, are relevant for dissipation rates of natural lakes (e.g., Zone II). Therefore, in our discussion of Figure 5 we have now included the recent findings of Wu et al. (2024) (reference number [64] of the revised manuscript), who studied bloom formation of Microcystis in mesocosm experiments at dissipation rates representative of natural conditions; see also our reply to the next comment.

      The authors should consider testing the space of Zone II on their phase map, for instance at very high particle concentrations and even lower rotational speeds, in order to show that their derivations match experiments.

      Good point. As mentioned in our answer above to comment #4 of the public review, Zone II lies beyond the measuring range of our experimental setup. Instead, we refer to the recent study of Wu et al. (2024) (reference number [64] of the revised manuscript) which demonstrated that dense scum layers of Microcystis colonies are aggregation-dominated. These mesocosm experiments agree with our model predictions and their parameter range falls within Zone II. We have included in the revised version, lines 328-337, a detailed discussion where we elucidate the parameter range covered in our experiments and compare our predictions for Zone II with the recent findings of Wu et al. (2024).

      The authors should show their calibration data and fit for the correction function of equation S1. Additionally, you may consider showing "raw" and "corrected" histograms of the size distribution, to demonstrate exactly what corrections are made.

      As mentioned in our answer above to comment #5 of the public review, we have included in the revised version of the Supporting Information the new Supplementary Figure S8, which shows the raw and adjusted histograms of the size distribution, including the associated uncertainties. Furthermore, the correction function is now explained in detail in the new Supporting Information Text in lines 785-796.

      The authors might consider commenting on Figure S3 a bit more in the main text. Even at very high dissipation rates, the cyanobacterial groups don't plummet to size 1, but stay in an equilibrium around 10-20x the diameter of a single cell. What might this mean for industrial applications trying to break up the groups?

      We agree with the reviewer that further discussion of Figure S3, panels E and F, is warranted. In the revised version of the manuscript, under section Fragmentation of Microcystis colonies occurs through erosion in lines 133-137, we have now included a discussion of this figure. Figure S3F shows that more than 90% of the total biovolume ends up in the category “small colonies” (mostly single cells and dimers); hence, most of the initially large colonies do fragment to single cells or dimers. Only about 5-10% of the biovolume remains as “large colonies” of 10-20 cells. Although it is challenging to draw definitive conclusions about the behavior of these remaining large colonies, as they account for only a minor fraction of the suspension, one hypothesis is that variability in mechanical properties between colonies results in a subset of colonies exhibiting exceptional resistance even to very high dissipation rates (see lines 133-137).

      Minor comments:

      Typo Caption of Figure 2: Should read [m^2/s^3] for units

      Thanks for catching this typo. The units in the caption of Figure 2 has been corrected to [m^2/s^3].

      There is no Equation 10 in Materials and Methods as indicated in the rheology section.

      We thank the reviewer for pointing out the lack of clarity in this algebraic manipulation. In fact, the yield stress has to be substituted in the current Equation 11 (previously Eq.10), from which the critical dissipation rate must be substituted in Equation 3. The result is the critical colony size (l* = 2.8) mentioned in line 243 of the revised manuscript. The correct equation numbers and algebraic substitutions are now indicated in lines 241-243 of the revised version of the manuscript.

      <Reviewer #2 (Public review):

      Especially the introduction seems to imply that shear force is a very important parameter controlling colony formation. However, if one looks at the results this effect is overall rather modest, especially considering the shear forces that these bacterial colonies may experience in lakes. The main conclusion seems that not shear but bacterial adhesion is the most important factor in determining colony size. As the importance of adhesion had been described elsewhere, it is not clear what this study reveals about cyanobacterial colonies that was not known before.

      We would like to emphasize several key findings that our study reveals about the impacts of fluid flow on cyanobacterial colonies:

      (I) Quantification of mechanical strength in cyanobacterial colonies: Our results demonstrate the high mechanical strength of cyanobacterial colonies, as evidenced by the requirement of high shear rates to achieve fragmentation. This is new knowledge, that was not known before for cyanobacterial colonies. To this end, our study highlights the resilience of these colonies against naturally occurring flows and bridges the gap between theoretical assumptions about colony strength and experimentally measured mechanical properties.

      (II) The discovery that the mechanical strength of colonies differs between colonies formed by cell division and colonies formed by aggregation. This is again new knowledge, that was not known before for cyanobacterial colonies.

      (III) Validation of a hypothesis regarding colony formation: Using a fluid-mechanical approach, we confirm the findings of recent genetic studies (references 25 and 67 of the revised version of the manuscript) which indicated that colony formation occurs predominantly via cell division rather than cell aggregation under natural conditions (except in very dense blooms).

      (IV) Practical guidelines for cyanobacterial bloom control: Our findings provide valuable insights into the design of artificial mixing systems applied in several lakes. Artificial mixing of lakes is based on fundamentals of fluid flow, aiming at preventing aggregation of buoyant cyanobacteria in scum layers at the water surface. Our results show that the dissipation rates generated by bubble blumes in artificially mixed lakes can fragment cyanobacterial colonies formed by aggregation, but are not intense enough to cause fragmentation of division-formed colonies (see Figure 5 and lines 348-360).

      The agreement between model and experiments is impressive, but the role of the fit parameters in achieving this agreement needs to be further clarified.

      The influence of the fit parameters (namely the stickiness α1 and the pairs of colony strength parameters S1,q1,S2,q2) is discussed in the sections Dynamical changes in colony size modelled by a two-category distribution in lines 247-253 and Materials and Methods in lines 559-565. We kept the discussion concise to maintain readability. However, we agree with the reviewer that additional details about the importance of the fit parameters and the sensitivity of the results to these parameters could be beneficial. In the revised version of the section Materials and Methods in lines 560-563, we have included a detailed discussion of the fit parameters.

      The article may not be very accessible for readers with a biology background. Overall, the presentation of the material can be improved by better describing their new method.

      We apologize for the limited readability of the description of the experimental setup and model used. In the revised version of the manuscript and the SI, we have detailed further the new methods presented here. The modifications include a detailed description of the operating range of the cone-and-plate shear setup (subsection Cone-and-plate shear of the section Materials and Methods, in lines 462-473). Furthermore, we think that incorporation of the recent experimental results of Wu et al. (2024), on lines 331-337 of the manuscript, will appeal to readers with a biology background. Their mesocosm experiments support our model prediction that aggregation is the dominant mechanism for colony formation in region (II) of Figure 5.

      Reviewer #2 (Recommendations For The Authors):

      (1) The authors seem too modest in claiming technological advance. They should describe the technological advance of combining microscopy with rheometry, in such a way that this invites others to apply this or similar approaches on biological samples. Even though I feel that the advancement of knowledge of this system by their method is relatively modest, there may be more advances in other systems.

      We appreciate the positive view of the reviewer towards the importance of this technology and we agree that its advantages should be advertised to researchers investigating similar systems. We have now given more attention to the technological advance of combining microscopic imaging with rheometry in the final paragraph of the Conclusions (lines 386400), where we now also briefly discuss an interesting recent study of marine snow (Song et al. 2023, Song and Rau 2022, reference numbers 70 and 71 of the revised manuscript), which used a similar combination of microscopy and rheometry as in our study. Furthermore, in the Methods section, we now briefly explain how the rheometry can be adjusted to investigate other systems (lines 474-480).

      (2) It seems reasonable -also based on what we already know about these aggregates - to assume that the main difference in shear sensitivity between field samples and cultures lies in the production of extracellular polysaccharide substance (EPS). To go beyond what is already known, the study could try to provide more direct and quantitative evidence for EPS involvement. For example, using a chemical quantification of EPS levels, or perturbing EPS levels using digestive enzymes.

      We agree with the reviewer that further characterization of the EPS is highly relevant to understand the mechanical strength of colonies. However, we believe that chemical quantification and/or degradation of EPS lies beyond the scope of our article and should be addressed by future studies.

      (3) Assuming EPS is indeed the reason for the differences in shear resistance: the authors speculate the reason why the field samples have more EPS lies in chemical composition (Calcium/nitrogen levels). In addition, there could be grazing that is known to promote aggregation (possibly increasing EPS), or just inherent genetic differences between strains. I am not necessarily expecting the authors to explore this direction experimentally, but it seems certainly feasible and would make the final result less speculative.

      We agree with the reviewer that there are more biotic and abiotic factors that can influence EPS amount and composition. The influence of grazing and other relevant factors on cell adhesion is discussed in references [26-29], cited in our introduction in lines 50-53. As discussed in our answer to recommendation #2, we believe that a quantitative investigation of these various factors is beyond the scope of this work and should be addressed in future studies.

      (4) A cool finding seems to be the critical relative diameter (Fig 2E), a colony size that seems invariant under shear. I was slightly surprised that the authors seem to take little effort to understand this critical diameter mechanistically (for example by predicting it, or experimentally perturbing it). Again, not a necessary requirement, but this is where the study could harness its technological advantage to provide a more quantitative understanding of something that goes beyond the existing knowledge of the system.

      We apologize to the reviewer if our descriptions and discussions of Figure 2 were unclear. One of the key conclusions from our experiments is that the critical relative diameter depends on the dissipation rate, as shown in Figure 2F. This dependence is also incorporated into the model through the constitutive equation (2). Furthermore, we expect the mechanical resistance of colonies, quantified by the critical relative diameter, to be affected by other biotic and abiotic factors that influence EPS amount and composition.

      (5) The jump from 0.019 to 1.1 m²/s³ seems large. What was the reason for not exploring intermediate values? The authors should also define low, modest and intense dissipation rates more clearly. Currently, they seem somewhat arbitrarily defined, i.e. 0.019 m²/s³ is described as low (methods) and moderate (results). In Fig 2, the authors further talk about low dissipation rates without a quantitative description.

      We thank the reviewer for pointing out the lack of clarity in the choice of parameter range and the nomenclature. Regarding the former, the suspension of division-formed colonies of Microcystis strain V163 displayed negligible fragmentation for dissipation rates between 0.019 to 1.1 m<sup>2</sup>/s<sup>3</sup>, as seen in Figures S2A and S3A. Due to the low sensitivity of the fragmentation results in this region, we don’t expect change in behavior for intermediate values. Regarding the nomenclature, we have corrected the inconsistencies throughout the text. We have chosen to name the dissipation rate values as: low for values typical of windmixing, moderate for values typical of the core of bubble plumes, and intense for values typical of propellers. Whenever mentioned in the text, the numerical value of dissipation rate is also included to avoid doubt.

      (6.) The structure and narrative of the paper can be improved. The article first describes all lab culture experiments and then the model, while the first figure already shows model fits. Perhaps it would be better to first describe the aggregation experiments, to constrain the appropriate terms of the model, and then move to fragmentation.

      We appreciate the recommendation of the reviewer regarding the structure. We have chosen to describe first the fragmentation experiments (Fig. 2), as these can be understood without introducing the aggregation effects. In contrast, the steady state results in the aggregation experiments (Fig. 3) come from the balance between aggregation and fragmentation. Therefore, we judged the current order to be more appropriate. The model fits are combined with the experimental results in Figures 2 and 3 to have a concise display. We have ensured that all the concepts required to understand each figure panel are explained prior to their discussion.

      (7) The number of data points that go into the histogram needs to be indicated. The main reason is that the authors report the distribution in terms of the biovolume fraction, suggesting the numerical counts are converted into volume. This to me seems like the most sensible parameter, but I could not find how this conversion is calculated (my apologies if I missed it). This seems especially relevant because a single large colony can impact this histogram quite considerably.

      We apologize for the lack of clarity in the calibration and conversion steps of the size distribution. As discussed above in the answer to comment #5 of the reviewer #1, more details of the calibration process have been added to the revised version of the Supporting Information Text in lines 785-796. Furthermore, the new Supplementary Figure S8 presents examples of the raw and adjusted size distribution, including the total number of counted colonies per histogram and the associated uncertainties in the concentration and biovolume distributions.

      (8) Over the timescales measured here, colonies could start sinking (or floating), possibly in a size-dependent manner, that could lead to a bias due to boundary effects. Did the authors consider this potential artifact?

      The sinking or floating of colonies is a relevant process which was taken into account in the choice of our parameter range for the dissipation rate. The minimum dissipation rate used in our experiments ensures that the upward inertial velocity near stagnation is sufficient to counteract the sedimentation of colonies. A detailed discussion of the choice of the parameter range is now included in the revised version of the Materials and Methods in lines 462-473.

      (9) "On the one hand, sequencing of the genetic diversity within Microcystis colonies supports the hypothesis that colony formation undernatural conditions is primarily driven by cell division [25]. On the other hand, cell aggregation can occur on a shorter time scale and may offer improved protection against high grazing pressure [26]." This appears somewhat constructed, as what is described as "on the other hand" is not evidence against the genetic diversity.

      We agree that the suggested dichotomy in this text appeared somewhat constructed, and we have now removed the wording “on the one hand” and “on the other hand”. The studies from reference [25] demonstrated that the genetic diversity between independent Microcystis colonies is much greater than the diversity within colonies. If cell aggregation was the dominant mechanism, a similar genetic diversity would be observed between and within colonies, which contrasts the findings from reference [25]. We have adjusted the text in the revised manuscript, in lines 46-54, to clarify this point.

      (10) The phase diagram seems largely based on extrapolations that are made outside of the measurement regime (e.g. dark red bars indicating the dissipation rate, Fig 5 - by the way 1 this color scheme could use some better contrast, by the way 2 Fig S7 suggests a wider dissipation rate range as indicated in Fig 5, why?). Hence there seems to be the need to more clearly lineate experimental results, simulations, and extrapolations in the phase diagram.

      We agree with the reviewer that further clarifications should be given about the parameter range covered in our experiments and apologize for the lack of readability in the color scheme of Fig 5. In lines 329-337, 346-347, 353-355, we have highlighted the parameters range covered by our experiments as well as the range covered by previous studies of windmixed mesocosm (namely reference [64] of the revised manuscript). Regarding the color scheme of Figure 5, we have modified the legend of the figure to improve readability. The color contrast was increased and leader lines were added to connect the colored bars with the respective label.

      (11) Unfortunately, the manuscript did not contain line numbers.

      We apologize to the reviewer for the lack of line numbers in our initial version. The revised version of the manuscript now contains line numbers, both in the main text and the supporting information.

      (12) Fig 2D. Caption is too minimal. Y-axis could better be named "Fraction of colonies" as both small and large colonies are plotted.

      The caption for Figure 2D was extended to better describe the plot. We have kept the y-axis label as “Fraction of small colonies”, since this is the quantity displayed by the three curves in the plot.

      (13) An inset should have axis labels.

      All the insets in our plots display the same variables as their respective plots. In order to keep the plots light and preserve readability, we therefore prefer to present the axis labels only along the x-axis and y-axis of the main plots, which implies by convention that the same axis labels also apply to the insets. To the best of our knowledge, this is a common approach.

      (14) Page 5, first words. Likely Fig 3A, not 2A was meant.

      We thank the reviewer for pointing out this readability issue. We intend to compare both Figures 2A and 3A. The text of the revised manuscript, in lines 146-148, has been adjusted with the correct figure numbers.

      (15) Introduction, second last paragraph, third last line. "suspension leaded to a broad distribution" I assume you meant "... led to a ..."

      We thank the reviewer for pointing out this typo. It has been corrected (line 122).

    1. Author response:

      The following is the authors’ response to the original reviews.

      eLife assessment

      In this study, the authors offer a theoretical explanation for the emergence of nematic bundles in the actin cortex, carrying implications for the assembly of actomyosin stress fibers. As such, the study is a valuable contribution to the field actomyosin organization in the actin cortex. While the theoretical work is solid, experimental evidence in support of the model assumptions remains incomplete. The presentation could be improved to enhance accessibility for readers without a strong background in hydrodynamic and nematic theories.

      To address the weaknesses identified in this assessment, we have expanded the motivation and description of the theoretical model, specifically insisting on the experimental evidence supporting its rationale and assumptions. These changes in the revised manuscript are implemented in the two first paragraphs of Section “Theoretical model” and in a more detailed description and justification of the different mathematical terms that appear in that section. We have made an effort to map in our narrative different terms to mechanistic processes in the actomyosin network. Even if the nature of the manuscript is inevitably theoretical, we think that the revised manuscript will be more accessible to a broader spectrum of readers.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      In this article, Mirza et al developed a continuum active gel model of actomyosin cytoskeleton that account for nematic order and density variations in actomyosin. Using this model, they identify the requirements for the formation of dense nematic structures. In particular, they show that self-organization into nematic bundles requires both flow-induced alignment and active tension anisotropy in the system. By varying model parameters that control active tension and nematic alignment, the authors show that their model reproduces a rich variety of actomyosin structures, including tactoids, fibres, asters as well as crystalline networks. Additionally, discrete simulations are employed to calculate the activity parameters in the continuum model, providing a microscopic perspective on the conditions driving the formation of fibrillar patterns.

      Strengths:

      The strength of the work lies in its delineation of the parameter ranges that generate distinct types of nematic organization within actomyosin networks. The authors pinpoint the physical mechanisms behind the formation of fibrillar patterns, which may offer valuable insights into stress fiber assembly. Another strength of the work is connecting activity parameters in the continuum theory with microscopic simulations.

      We thank the referee for these comments.

      Weaknesses:

      (A) This paper is a very difficult read for nonspecialists, especially if you are not well-versed in continuum hydrodynamic theories. Efforts should be made to connect various elements of theory with biological mechanisms, which is mostly lacking in this paper. The comparison with experiments is predominantly qualitative.

      We understand the point of the referee. While it is unavoidable to present the continuum hydrodynamic theory behind our results, we have made an effort in the revised manuscript to (1) motivate the essential features required from a theoretical model of the actomyosin cytoskeleton capable of describing its nematic self organization (two first paragraphs of Section “Theoretical model”), and to (2) explicitly explain the physical meaning of each of the mathematical terms in the theory, and when appropriate, relate them to molecular mechanisms in the cytoskeleton. We hope that the revised manuscript addresses the concern of the referee.

      Regarding the comparison with experiments, they are indeed qualitative because the main point of the paper is to establish a physical basis for the self-organization of dense nematic structures in actomyosin gels. Somewhat surprisingly, we argue that a compelling mechanism explaining the tendency of actomyosin gels to form patterns of dense nematic bundles has been lacking. As we review in the introduction, these patterns are qualitatively diverse across cell types and organisms in terms of geometry and dynamics, and for this reason, our goal is to show that the same material in different parameter regimes can exhibit such qualitative diversity. A quantitative comparison is difficult for several reasons. First, many of the parameters in our theory have not been measured and are expected to vary wildly between cell types. In fact, estimates in the literature often rely on comparison with hydrodynamic models such as ours. For this reason, we chose to delineate regimes leading to qualitatively different emerging architectures and dynamics. Second, the patterns of nematic bundles found across cell types depend on the interaction between (1) the intrinsic tendency of actomyosin gels to form such structures studied here and (2) other elements of the cellular context. For instance, polymerization and retrograde flow from the lamellipodium, the physical barrier of the nucleus, and the interaction with the focal adhesion machinery are essential to understand the emergence of stress fibers in adherent cells. Cell shape and curvature anisotropy control the orientation of actin bundles in parallel patterns in the wings and trachea of insects. Nuclear positions guide the actin bundles organizing the cellularization of Sphaeroforma arctica [11]. Here, we focus on establishing that actomyosin gels have an intrinsic ability to self organize into dense nematic bundles, and leave how this property enables the morphogenesis of specific structures for future work. We have emphasized this point in the revised section of conclusions.

      (B) It is unclear if the theory is suited for in vitro or in vivo actomyosin systems. The justification for various model assumptions, especially concerning their applicability to actomyosin networks, requires a more thorough examination.

      We thank the referee for this comment. Our theory is applicable to actomyosin gels originating from living cells. To our knowledge, the ability of reconstituted actomyosin gels from purified proteins to sustain the kind of contractile dynamical steady-states observed in living cells is very limited. In the revised manuscript, we cite a very recent preprint presenting very exciting but partial results in this direction [49]. Instead, reconstituted in vitro systems encapsulating actomyosin cell extracts robustly recapitulate contractile steady-states. This point has been clarified in the first paragraph of Section “Theoretical model”.

      (C) The classification of different structures demands further justification. For example, the rationale behind categorizing structures as sarcomeric remains unclear when nematic order is perpendicular to the axis of the bands. Sarcomeres traditionally exhibit a specific ordering of actin filaments with alternating polarity patterns.

      We agree with the referee and in the revised manuscript we have avoided the term “sarcomeric” because it refers to very specific organizations in cells. What we previously called “sarcomeric patterns”, where bands of high density exhibit nematic order perpendicular to the axis of the bands, is not a structure observed to our knowledge in cells. It is introduced to delimit the relevant region in parameter space. In the revised manuscript, we refer to this pattern as “banded pattern with perpendicular nematic organization” or “banded pattern” in short.

      (D) Similarly, the criteria for distinguishing between contractile and extensile structures need clarification, as one would expect extensile structures to be under tension contrary to the authors' claim.

      We thank the referee for raising this point, which was not sufficiently clarified in the original manuscript. We first note that in incompressible active nematic models, active tension is deviatoric (traceless and anisotropic) because an isotropic component would simply get absorbed by the pressure field enforcing incompressibility. Being compressible, our model admits an active tension tensor with deviatoric and isotropic components. We consider always a contractile (positive) isotropic component of active tension, but the deviatoric component can be either contractile (𝜅 > 0) or extensile (𝜅 < 0), where we follow the common terminology according to which in contractile/extensile active nematics the active stress is proportional to q with a positive/negative proportionality constant [see e.g. https://doi.org/10.1038/s41467018-05666-8]. Furthermore, as clarified in the revised manuscript, total active stresses accounting for the deviatoric and isotropic components are always contractile (positive) in all directions, as enforced by the condition |𝜅| < 1.

      For fibrillar patterns, we need 𝜅 < 0, and therefore active stresses are larger perpendicular to the nematic direction. This means that the anisotropic component of the active tension is extensile, although, accounting for the isotropic component, total active tension is contractile (see Fig. 1c). This is now clarified in the text following Eq. 7 and in Fig. 1.

      However, following fibrillar pattern formation and as a result of the interplay between active and viscous stresses, the total stress can be larger along the emergent dense nematic structures (“contractile structures”) or perpendicular to them (“extensile structures”). To clarify this point, in the revised Fig. 4 and the text referring to it, we have expanded our explanation and plotted the difference between the total stress component parallel to the nematic direction (𝜎∥) and the component perpendicular to the nematic direction (𝜎⊥), with contractile structures satisfying 𝜎∥ − 𝜎⊥ > 0 and extensile structures satisfying 𝜎∥ − 𝜎⊥ < 0. See lines 280 to 303. This is consistent with the common notion of contractile/extensile systems in incompressible nematic systems [see e.g. https://doi.org/10.1038/s41467-018-05666-8].

      (E) Additionally, its unclear if the model's predictions for fiber dynamics align with observations in cells, as stress fibers exhibit a high degree of dynamism and tend to coalesce with neighboring fibers during their assembly phase.

      In the present work, we focus on the self-organization of a periodic patch of actomyosin gel. However, in adherent cells boundary conditions play an essential role, as discussed in our response to comment (A) by this referee. In ongoing work, we are studying with the present model the dynamics of assembly and reconfiguration of dense nematic structures in domains with boundary conditions mimicking in adherent cells, possibly interacting with the adhesion machinery, finding dynamical interactions as those suggested by the referee. As an example, we show a video of a simulation where at the edge of the circular domain, there is an actin influx modeling the lamellipodium, and in four small regions friction is higher simulating focal adhesions. Under these boundary conditions, the model presented in the paper exhibits the kind of dynamical reorganizations alluded by the referee.

      Author response video 1.

      We would like to note, however, that the prominent stress fibers in cells adhered to stiff substrates, so abundantly reported in the literature, are not the only instance of dense nematic actin bundles. In the present manuscript, we emphasize the relation of the predicted organizations with those found in different in vivo contexts not related to stress fibers, such as the aligned patterns of bundles in insects (trachea, scales in butterfly wings), in hydra, or in reproductive organs of C elegans; the highly dynamical network of bundles observed in C elegans early embryos; or the labyrinth patters of micro-ridges in the apical surface of epidermal cells in fish.

      (F) Finally, it seems that the microscopic model is unable to recapitulate the density patterns predicted by the continuum theory, raising questions about the suitability of the simulation model.

      We thank the referee for raising this question, which needs further clarification. The goal of the microscopic model is not to reproduce the self-organized patterns predicted by the active gel theory. The microscopic model lacks essential ingredients, notably a realistic description of hydrodynamics and turnover. Our goal with the agent-based simulations is to extract the relation between nematic order and active stresses for a small homogeneous sample of the network. This small domain is meant to represent the homogeneous active gel prior to pattern formation, and it allows us to substantiate key assumptions of the continuum model leading to pattern formation, notably the dependence of isotropic and deviatoric components of the active stress on density and nematic order (Eq. 7) and the active generalized stress promoting ordering.

      We should mention that reproducing the range of out-of-equilibrium mesoscale architectures predicted by our active gel model with agent-based simulations seems at present not possible, or at least significantly beyond the state-of-the-art. To our knowledge, these models have not been able to reproduce the heterogeneous nonequilibrium contractile states involving sustained self-reinforcing flows underlying the pattern formation mechanism studied in our work. The scope of the discrete network simulations has been clarified in lines 340 to 349 in the revised manuscript.

      While agent-based cytoskeletal simulations are very attractive because they directly connect with molecular mechanisms, active gel continuum models are better suited to describe out-of-equilibrium emergent hydrodynamics at a mesoscale. We believe that these two complementary modeling frameworks are rather disconnected in the literature, and for this reason, we have attempted substantiate some aspects of our continuum modeling with discrete simulations. We have emphasized the complementarity of the two approaches in the conclusions.

      Reviewer #1 (Recommendations For The Authors):

      Questions on the theory:

      Does rho describe the density of actin or myosin? The authors say that they are modeling actomyosin material as a whole, but the actin and myosin should be modeled separately. Along, similar lines, does Q define the ordering of actin or myosin?

      Active gel models of the actomyosin cytoskeleton have been formulated with independent densities for actin and for myosin or using a single density field, implicitly assuming a fixed stoichiometry. Super-resolution imaging of the actomyosin cytoskeleton also suggest that in principle it makes sense to consider different nematic fields for actin and for myosin filaments. In the revised manuscript, we now explicitly mention that our density and nematic field are effective descriptions of the entire actomyosin gel (lines 82-84).

      A more detailed model would entail additional material parameters, not available experimentally, which may help reproduce specific experiments but that would make the systematic study of the different behaviors much more difficult. Our approach has been to keep the model minimal meeting the fundamental requirements outlined in the first paragraphs of Section “Theoretical model”.

      Should the active stress depend on material density? It seems strange (from Eq. 3) that active stress could be non-zero even where density is zero, since sigma_act does not depend on rho.

      Yes, active stress is assumed to be proportional to density. Eq. 3 in the original manuscript was misleading (it was multiplied by rho in Eq. 2). In the revised manuscript, we have explained with a bit more detail the theoretical model, clarifying this point.

      The authors should clearly explain their rationale for retaining certain types of nonlinear terms while ignoring others in theory. For instance, the nonlinearities in the equations of motion are sometimes quadratic in the fields, while there are also some cubic terms. Please remark up to what order in the fields the various interactions are modeled.

      We thank the referee for raising this point. The nonlinearities in the theory are easily explained on the basis of a small number of choices. We have added a new paragraph towards the end of Section “Theoretical model” (lines 145 to 152) providing a rationale for the origin and underlying assumptions leading to different nonlinearities.

      To connect with experiments and the biological context, please explain the biological origin of various terms in the model: (1) L-dependent terms in Eq. 2 and 4, (2) Flowalignment of nematic order and experimental evidence in support of it, (3) densitydependent susceptibility terms in Eq. 4

      (1) Unfortunately, the L-dependent terms are very bulky, but are very standard in nematic theories. The best way to understand their physical significance is through the expression of the nematic free-energy, which is now given and explained in the revised manuscript (Eq. 3). The resulting complicated expression for the molecular field and the nematic stress (Eqs. 4 and 5) are mathematical consequences of the choice of nematic free energy. In the revised manuscript, we also attempt to provide a basis for these terms in the context of the actin cytoskeleton. (2) To our knowledge, the best reference supporting this term from experiments is Reymann et al, eLife (2016). In the revised manuscript, we have provided a physical interpretation. (3) We have expanded the motivation and plausible microscopic justification of this term.

      There are different 'activity' terms in the model. Their biophysical origin is not made clear. For example, the authors should make clear if these activities arise from filament or motor activity. Relatedly, the authors should provide a comprehensive discussion of the signs of the different active parameters and their physical interpretations.

      In an active gel model, activity parameters are phenomenological and how they map to molecular mechanisms is not precisely known, although conventionally contractile active tension is ascribed to the mechanical transduction of chemical power by myosin motors. The fact is that, besides myosin activity, there are many nonequilibrium processes in the actomyosin cytoskeleton that may lead to active stresses including (de)polymerization of filaments or (un)binding of crosslinkers. In the revised manuscript, we have added sentences illustrating how different terms may result from microscopic mechanisms, but providing a precise mapping between our model and nonequilibrium dynamics of proteins is beyond the scope of our work, although our discrete network simulations address this issue to a certain degree.

      Following the suggestion of the referee, our description of the theory now discusses much more extensively the signs of activity parameters and their physical interpretations, e.g. the text following Eq. 7.

      Throughout the paper, various activity terms are varied independently of each other. Is that a reasonable assumption given that activities should depend on ATP and are thus not independent of one another?

      We agree that, ultimately, all active process depend on the conversion of chemical energy into mechanical energy. However, recent work has highlighted how active tension also depends on the microscopic architecture of the network controlled by multiple regulators of the actomyosin cytoskeleton (e.g. Chug et al, Nat Cell Biol, 2017). It is reasonable to expect that, for a given rate of ATP consumption, chemical power will be converted into mechanical power in different ways depending on the micro-architecture of the cytoskeleton, e.g. the stoichiometry of filaments, crosslinkers, myosins, or the length distribution of filaments (very long filaments crosslinked by myosins may be difficult to reorient but may contract efficiently).

      We have added a paragraph in Section “Theoretical model” with a discussion, lines 153 to 156.

      Sarcomeres are muscle fibers that exhibit alternating polarity pattern. Such patterning is not evident in what the authors call 'sarcomeres' in Fig. 2. I believe the authors should revise their terminology and not loosely interpret existing classifications in the field.

      We thank the referee for raising this point. We have changed the terminology.

      Fig 2a: Is the cartoon for filament alignment incorrect for kappa>0?

      The cartoon is correct. In the revised manuscript we have explained more clearly the physical meaning of kappa in the text following Eq. 7. In the caption of Fig. 1 and of Fig. 2a, we have also clarified that when the absolute value of kappa is <1, then active tension is positive in all directions.

      Within the section "Requirements for fibrillar and banded patterns", it will be useful to show the figures for varying the different active parameters in the main figures.

      We have followed the referee’s suggestion and moved Supp. Fig. 1 of the original manuscript to the main figures.

      How do the authors decide if bundles are contractile or extensile? Why are contractile bundles under tension while extensile bundles are under compression? I would expect the opposite.

      We agree that this point deserves a more detailed explanation. In the revised manuscript and in the new Figure 4, we further develop this point. The fibrillar pattern forms when kappa<0. We further assume that -1<kappa<0, so that active tension is positive in all directions. In this regime, the deviatoric (anisotropic) part of active tension is extensile. However, following pattern formation and because of the interplay between active and viscous stresses, the total stress in the emerging bundles may become extensile or contractile, depending on whether the largest component of stress is perpendicular or along the bundle axis. This is now presented in the updated figure, with new panels presenting maps of the total tension. The text discussing this point has been rewritten and we hope that the new version is much clearer (lines 280 to 303).

      A contractile bundle tends to shorten, but it cannot do it because of boundary conditions or the interaction with other bundles. As a result they are in tension. Conversely, an extensile bundle tries to elongate, but being constrained, it becomes compressed. As an analogy, consider the cortex of a suspended cell. The cortex is contractile, but it cannot contract because of volume regulation in th cell, which is typically pressurized. As a result, tension in the cortex is positive, as shown by Laplace’s law [10.1016/j.tcb.2020.03.005]. We have tried to clarify this point in the revised manuscript.

      Can the authors reproduce alternating density patterns using the cytosim simulations? This is an important step in establishing the correspondence between the continuum theory and the agent-based model.

      We have addressed this point in our response to public comment (F) of this referee.

      The authors do not provide code or data.

      The finite element code with an input file require to run a representative simulation in the paper is now made available, see Ref. [74].

      The customizations of Cytosim needed to account for nematic order in our discrete network simulations are available, see Ref. [98].

      Reviewer #2 (Public Review):

      Summary:

      The article by Waleed et al discusses the self organization of actin cytoskeleton using the theory of active nematics. Linear stability analysis of the governing equations and computer simulations show that the system is unstable to density fluctuations and self organized structures can emerge. While the context is interesting, I am not sure whether the physics is new. Hence I have reservations about recommending this article.

      We thank the referee for these comments. In the revised manuscript, we have highlighted the novelty, particularly in the last paragraph of the introduction, the first two paragraphs of Section “Theoretical model”, and in the conclusions. Despite a very large literature on theoretical models of stress fibers, actin rings, and active nematics, we argue that the active self-organization of dense nematic structures from an isotropic and low-density gel has not been compellingly explained so far. Many models assume from the outset the presence of actin bundles, or explain their formation using localized activity gradients. The literature of active nematics has extensively studied symmetry breaking and the self-organization. However, most of the works assume initial orientational order. Only a few works study the emergence of nematic order from a uniform isotropic state, but consider dry systems lacking hydrodynamic interactions or incompressible and density-independent systems [37,38]. Yet, pattern formation in actomyosin gels is characterized by large density variations, and by highly compressible flows, which coordinate in a mechanism relying on an advective instability and self-reinforcing flows.

      Our theoretical model is not particularly novel, and as we mention in the manuscript, it can be particularized to different models used in the literature. However, we argue that it has the right minimal features to capture nematic self-organization in actomyosin gels. To our knowledge, no previous study explains the emergence of dense and nematic structures from a low-density isotropic gel as a result of activity and involving the advective instability typical of symmetry-breaking and patterning in the actomyosin cytoskeleton. These are important qualitative features of our results that resonate with a large experimental record, and as such, we believe that our work provides a new and compelling mechanism relying on self-organization to explain the prominence and diversity of patterns involving dense nematic bundles in the actomyosin cytoskeleton across species.

      Strengths:

      (i) Analytical calculations complemented with simulations (ii) Theory for cytoskeletal network

      Weaknesses:

      Not placed in the context or literature on active nematics.

      We agree with the referee that this was a weakness of the original manuscript. In the revised manuscript, within reasonable space constraints given the size and dynamism of the field of active nematics, we have placed our work in the context of this field (end of introduction and first two paragraphs of Section “Theoretical model”). The published version of our companion manuscript [45] also contributes to providing a clear context to our theoretical model within the field.

      Reviewer #2 (Recommendations For The Authors):

      The article by Waleed et al discusses the self organization of actin cytoskeleton using the theory of active nematics. Linear stability analysis of the governing equations and computer simulations show that the system is unstable to density fluctuations and self organized structures can emerge. While the context is interesting, I am not sure whether the physics is new. Hence I have reservations about recommending this article. I explain my questions comments below.

      We have responded to this comment above.

      (i) Active nematics including density variations have been dealt quite extensively in the literature. For example, the works of Sriram Ramaswami have dealt with this system including linear stability analysis, simulations etc. In what way is the present work different from the system that they have considered?

      (ii) Active flows leading to self organization has been a topic of discussion in many works. For example: (i) Annual Review of Fluid Mechanics, Vol. 43:637-659, 2010, https://doi.org/10.1146/annurev-fluid-121108-145434 (ii) S Santhosh, MR Nejad, A Doostmohammadi, JM Yeomans, SP Thampi, Journal of Statistical Physics 180, 699-709 (iii) M. G. Giordano1, F. Bonelli2, L. N. Carenza1,3, G. Gonnella1 and G. Negro1, Europhysics Letters, Volume 133, Number 5. In what way this work is different from any of these?

      (iii) I am confused about the models used in the paper. There is significant literature from Prof. Mike Cates group, Prof. Julia Yeomans group, Prof. Marchetti's group who all use similar governing equations. In the present paper, I find it hard to understand whether the model used is similar to the existing ones in literature or are there significant differences. It should be clarified.

      Response to (i), (ii) and (iii).

      We completely agree with this referee (and also the previous referee), that the contextualization of our work in the field of active nematics was very insufficient. In the revised manuscript, the last paragraph of the introduction and the first two paragraphs of Section “Theoretical model” now address this point. In short, previous active nematic models predicting patterns with density variations have been either for dry active matter (disregarding hydrodynamic interactions), or for suspensions of active particles moving in an incompressible flow. None of these previous works predict nematic pattern formation as a result of activity relying on the advective instability and self-reinforcing compressible flows, leading to high density and high order bundles surrounded by an isotropic low density phase. Yet, these are fundamental features observed in actomyosin gels. Many works deal with symmetry-breaking of a system with pre-existing order, but very few address how order emerges actively from an isotropic state. We thank the referee for pointing at the paper by Santhosh et al, who nicely make this argument and is now cited. Our mechanism is fundamentally different from that in Santhosh, whose model is incompressible and ignores density variations.

      We hope that the revised manuscript addresses this important concern.

      (i) >(iv) Below Eqn 6, it starts by saying that the “...origin..is clear...” Its not. I don't understand the physical origin of the instability, and this should be clarified, may be with some illustrations.

      We apologize for this unfortunate sentence, which we have rewritten in the revised manuscript (lines 181 to 185).

      Reviewer #3 (Public Review):

      The manuscript "Theory of active self-organization of dense nematic structures in the actin cytoskeleton" analysis self-organized pattern formation within a two-dimensional nematic liquid crystal theory and uses microscopic simulations to test the plausibility of some of the conclusions drawn from that analysis. After performing an analytic linear stability analysis that indicates the possibility of patterning instabilities, the authors perform fully non-linear numerical simulations and identify the emergence of stripelike patterning when anisotropic active stresses are present. Following a range of qualitative numerical observations on how parameter changes affect these patterns, the authors identify, besides isotropic and nematic stress, also active self-alignment as an important ingredient to form the observed patterns. Finally, microscopic simulations are used to test the plausibility of some of the conclusions drawn from continuum simulations.

      The paper is well written, figures are mostly clear and the theoretical analysis presented in both, main text and supplement, is rigorous. Mechano-chemical coupling has emerged in recent years as a crucial element of cell cortex and tissue organization and it is plausible to think that both, isotropic and anisotropic active stresses, are present within such effectively compressible structures. Even though not yet stated this way by the authors, I would argue that combining these two is of the key ingredients that distinguishes this theoretical paper from similar ones. The diversity of patterning processes experimentally observed is nicely elaborated on in the introduction of the paper, though other closely related previous work could also have been included in these references (see below for examples).

      We thank the referee for these comments and for the suggestion to emphasize the interplay of isotropic and anisotropic active tension, which is possible only in a compressible gel, as mentioned in the revised manuscript. We have emphasized this point in different places in the revised manuscript. We thank the suggestions of the referee to better connect with existing literature.

      To introduce the continuum model, the authors exclusively cite their own, unpublished pre-print, even though the final equations take the same form as previously derived and used by other groups working in the field of active hydrodynamics (a certainly incomplete list: Marenduzzo et al (PRL, 2007), Salbreux et al (PRL, 2009, cited elsewhere in the paper), Jülicher et al (Rep Prog Phys, 2018), Giomi (PRX, 2015),...). To make better contact with the broad active liquid crystal community and to delineate the present work more compellingly from existing results, it would be helpful to include a more comprehensive discussion of the background of the existing theoretical understanding on active nematics. In fact, I found it often agrees nicely with the observations made in the present work, an opportunity to consolidate the results that is sometimes currently missed out on. For example, it is known that self-organised active isotropic fluids form in 2D hexagonal and pulsatory patterns (Kumar et al, PRL, 2014), as well as contractile patches (Mietke et al, PRL 2019), just as shown and discussed in Fig. 2. It is also known that extensile nematics, \kappa<0 here, draw in material laterally of the nematic axis and expel it along the nematic axis (the other way around for \kappa>0, see e.g. Doostmohammadi et al, Nat Comm, 2018 "Active Nematics" for a review that makes this point), consistent with all relative nematic director/flow orientations shown in Figs. 2 and 3 of the present work.

      We thank the referee for these suggestions. Indeed, in the original submission we had outsourced much of the justification of the model and the relevant literature to a related pre-print, but this is not reasonable. The companion publication has now been accepted in the New Journal of Physics, with significant changes to better connect the work to the field of active nematics. A preprint reflecting those changes is available in Ref. [64], but we hope to reference the published paper that will come out soon.

      In the revised manuscript, we have significantly rewritten the Section “Theoretical model” to frame the continuum model in the context of the field of active nematics. While our model and results have commonalities with previous work, there are also important differences. We have highlighted the novelty of the present work along with the relation with previous studies and theoretical models in the last paragraph of the introduction and the first two paragraphs of Section “Theoretical model”. Furthermore, as suggested by the referee, we have made an effort to connect our results with previous work by Kumar, Mietke, Doostmohammadi and others.

      Regarding the last point alluded by the referee (“extensile nematics, \kappa<0 here, draw in material laterally of the nematic axis and expel it along the nematic axis”), the picture raised by the referee would be nuanced for our compressible system as compared to the incompressible systems discussed in that reference. As we have elaborated in our response to point (D) of Referee #1, our systems are overall contractile (with positive active tension in all directions), but the deviatoric component of the active tension can be either extensile or contractile. In our “extensile” models (left in Fig. 2c), material is drawn to laterally to the nematic axis but it is not expelled along this axis. Instead, it is “expelled” by turnover. In the revised manuscript, we have added a comment about this.

      The results of numerical simulations are well-presented. Large parts of the discussion of numerical observations - specifically around Fig. 3 - are qualitative and it is not clear why the analysis is restricted to \kappa<0. Some of the observations resonate with recent discussions in the field, for example the observation of effectively extensile dynamics in a contractile system is interesting and reminiscent of ambiguities about extensile/contractile properties discussed in recent preprints (https://arxiv.org/abs/2309.04224). It is convincingly concluded that, besides nematic stress on top of isotropic one, active self-alignment is a key ingredient to produce the observed patterns.

      We thank the referee for these comments. We are reluctant to extend the detailed analysis of emergent architectures and dynamics to the case \kappa > 0 as it leads to architectures not observed, to our knowledge, in actin networks. In the revised manuscript, we have expanded and clarified the characterization of emergent contractile/extensile networks by reporting the relative magnitude of stress along and perpendicular to the nematic direction. Our revised manuscript clearly shows that even though all of our simulations describe locally contractile systems with extensile anisotropic active tension, the emergent meso-structures can be either extensile or contractile, with the extensile ones exhibiting the usual bend-type instability (a secondary instability in our system) described classically for extensile active nematic systems. We have rewritten the text discussing this (lines 280 to 303), where we have placed these results in the context of recent work reporting the nontrivial relation between the contractility/extensibility of the local units vs the nematic pattern.

      I compliment the authors for trying to gain further mechanistic insights into this conclusion with microscopic filament simulations that are diligently performed. It is rightfully stated that these simulations only provide plausibility tests and, within this scope, I would say the authors are successful. At the same time, it leaves open questions that could have been discussed more carefully. For example, I wonder what can be said about the regime \kappa>0 (which is dropped ad-hoc from Fig. 3 onward) microscopically, in which the continuum theory does also predict the formation of stripe patterns - besides the short comment at the very end? How does the spatial inhomogeneous organization the continuum theory predicts fit in the presented, microscopic picture and vice versa?

      We thank the referee for this compliment. We think that the point raised by the referee is very interesting. It is reasonable to expect that the sign of \kappa may not be a constant but rather depend on S and \rho. Indeed, for a sparse network with low order, the progressive bundling by crosslinkers acting on nearby filaments is likely to produce a large active stress perpendicular to the nematic direction, whereas in a dense and highly ordered region, myosin motors are more likely to effectively contract along the nematic direction whereas there is little room for additional lateral contraction by additional bundling. As discussed in our response to referee #1, we believe that studying the formation of patterns using the discrete network simulations is far beyond the scope of our work. We discuss in lines 332 to 341, as well as in the last paragraph of the conclusions, the scope and limitations of our discrete network simulations.

      Overall, the paper represents a valuable contribution to the field of active matter and, if strengthened further, might provide a fruitful basis to develop new hypothesis about the dynamic self-organisation of dense filamentous bundles in biological systems.

      Reviewer #3 (Recommendations For The Authors):

      • The statement "the porous actin cytoskeleton is not a nematic liquid-crystal because it can adopt extended isotropic/low-order phases" is difficult to understand and should be clarified, as the next paragraph starts formulating a nematic active liquid crystal theory. Do the authors mean a crystal that "Tends to be in a disordered phase?", according to its equilibrium properties? It would still be a "nematic liquid crystal", only its ground state is not a nematic phase.

      We agree with the referee, and we hope that changes in the introduction and in Section “Theoretical model” address this comment.

      • I could not find what Frank energy is precisely used, that would be helpful information.

      In the revised manuscript, we have provided the expression for the nematic free energy in Eq. 3.

      • The Significance of green/purple arrows in Fig 2a sketch unclear, green arrows also in b,c, do they represent the same quantity? From the simulations images it is overall it is very difficult to see how the flows are oriented near the high-density regions (i.e. if they are towards / away from the strip).

      We thank the referee for bringing this up. The colorcodings of the sketches were confusing. The modified figures (Fig. 1(c) and Fig. 2(a)) present now a clearer and unified representation of anisotropic tension. The green arrows in Fig. 2(c) represent the out-of-equilibrium flows in the steady state. We agree that the zoom is insufficient to resolve the flow structure. For this reason, in the revised Fig. 2, we have added additional panels showing the flow with higher resolution.

      • It is currently unclear how the linear stability results - beyond identification of the parameter \delta - inform any of the remaining manuscript. Quantitative comparisons of the various length scales seen in simulated patterns (e.g. Fig. 2b, 3c etc) with linear predictions and known characteristic length scales would be instructive mechanistically, would make the overall presentation more compelling and probes limitations of linear results.

      In the revised manuscript, we have provided further information so that the readers can appreciate the predictions and limitations of the linear stability results. We have added a sentence and a Figure to show that, in addition to the critical activity, the linear theory provides a good prediction of the wavelengh of the pattern. See lines 199 to 201.

      • It is not clear what is meant by "[bundle-formation] requires that active tension perpendicular to nematic orientation is larger than along this direction", and therefore also not why that would be "counter-intuitive". If interpreted naively, I would say that a large tension brings in more filaments into the bundle, so that may well be an obviously helpful feature for bundle formation and maintenance. In any case, it would be helpful if clarity is improved throughout when arguments about "directions of tensions" are made.

      We have significantly rewritten the first paragraphs of section “Microscopic origin…” to clarify this point (lines 330 to 339). This paragraph, along with other changes in the manuscript such as the explanation of Eq. 7 or the discussion about the stress anisotropy in the new version of Fig. 4 (see lines 280 to 303), provide a better explanation of this important point.

      • All density color bars: Shouldn't they rather be labelled \rho/\rho_0?

      Yes! We have corrected this typo.

      • Scalar product missing in caption definition of order parameter Fig. 2

      We have corrected this typo.

      • Fig. 3a: I suggest to put the expression for q0 in the caption

      We have changed q_0 by S_0 and clarified its meaning in the caption of what now is Fig 4.

      • Paragraph on bottom right of page 6 should several times probably refer to Fig. 3c(...), instead of Fig. 3b

      We have corrected this typo.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review): 

      Strengths: 

      Overall, this manuscript is well-written and contains a large amount of high-quality data and analyses. At its core, it helps to shed light on the overlapping roles of Edc3 and Scd6 in sculpting the yeast transcriptome. 

      Weaknesses: 

      (1) While the data presented makes conclusions about mRNA stability based on corresponding ChIP-Seq analyses and analyzing other mutants (e.g. Dcp2 knockout), at no point is mRNA stability actually ever directly assessed. This direct assessment, even for select transcripts, would further strengthen their conclusions. 

      We appreciate the reviewer’s concern but wish to emphasize that we conducted ChIP-Seq analysis of RNA Polymerase II occupancies in the CDSs of all genes, known to be a reliable indicator of transcription rate, and found only small increases in Pol II occupancies that cannot account for the increased transcript levels of the cohort of mRNAs up-regulated in the scd∆6edc3∆ double mutant (Fig. 3E). This provides strong evidence that increased transcription is not the main driver of increased mRNA abundance in this mutant.  Bolstering this conclusion, we showed that the Hap2/Hap3/Hap4/Hap5 complex of transcription factors responsible for induction of Ox. Phos. genes was not activated in scd6Δedc3Δ cells in glucose medium (Fig. 6F(ii)); nor was the Adr1 activator of CCR genes activated (Fig. S9C(i)), ruling out transcriptional induction of their target genes in glucose-replete scd6Δ/edc3Δ cells and instead favoring reduced degradation as the mechanism underlying derepression of Ox. Phos. and CCR gene transcripts in this mutant. In Fig. 3B, we further showed that the majority of mRNAs up-regulated in the scd6Δedc3Δ double mutant are also derepressed by dcp2Δ, and in Fig. 3D that the mRNAs up-regulated in scd∆6edc3∆ cells exhibit a higher than average codon protection index (CPI) indicating a heightened involvement of decapping and co-translational degradation by Xrn1 in their decay. To provide additional support for our conclusion, we have conducted new experiments to measure the abundance of capped mRNAs genome-wide by CAGE sequencing of total mRNA in both WT and scd∆6edc3∆ cells.  As established previously, normalizing CAGE TPMs to total mRNA TPMs determined by RNA-Seq, dubbed the C/T ratio, provides a reliable measure of the capped proportion of each transcript.  The new data presented in Fig. 3C indicate that the mRNAs up-regulated in the scd∆6edc3∆ mutant have significantly lower than average C/T ratios in WT cells, whereas the C/T ratios for the down-regulated transcripts are higher than average, and that these differences between the two groups and all expressed mRNAs are diminished in the scd∆6edc3∆ double mutant. These are the results expected if the up-regulated mRNAs are selectively targeted for decapping in WT cells dependent on Edc3/Scd6, whereas the downregulated mRNAs are targeted by Edc3/Scd6 less than the average transcript. In the original version of the paper, we came to the same conclusion by analyzing our previous CAGE data for the dhh1∆ mutant for the same transcripts dysregulated scd∆6edc3∆ cells, now presented as supportive data in Fig. S3F. Finally, we added the fact that among all four Dhh1 target mRNAs examined in the previous study of He et al. (2022) and found here to be up-regulated selectively in the scd6∆edc3∆ double mutant (Fig. S10), two of them (SDS23 and HXT6) were shown directly to have longer half-lives in dhh1∆ vs. WT cells by He et al. (2018). Hence, the combined evidence is compelling that selective up-regulation of particular mRNAs in the scd∆6edc3∆ mutant results from diminished decapping/decay rather than enhanced transcription; and we feel that the additional supporting evidence that would be provided by measuring half-lives of a small group of up-regulated transcripts would not justify the considerable effort required to do so.  Moreover, the standard approach for such experiments of impairing transcription with an inhibitor of Pol II or a Pol II Ts<sup>-</sup> mutation has been criticized because of the known buffering (suppression) of mRNA decay rates in response to impaired transcription.

      (2) Scd6 and Edc3 show a high level of functional redundancy, as demonstrated by the double mutant. As these proteins form complexes with other decapping factors/activators, I'm curious if depleting both proteins in the double mutant destabilizes any of these other factors. Have the authors ever assessed the levels of other key decapping factors in the double mutants (i.e. Dhh1, Pat1, Dcp2...etc)? I wonder if depleting both proteins leads to a general destabilization of key complexes. It would also be interesting to see if depleting Edc3 or Scd6 leads to a concomitant increase in the other protein as a compensatory mechanism. 

      We thank the reviewer for this insight.  Examining our Ribo-Seq and TMT-MS data revealed that Dhh1 expression and steady-state abundance are increased ~2-fold in the scd6∆edc3∆ strain, indicating that the up-regulation of many of the same mRNAs by scd6∆edc3∆ and dhh1∆ does not result indirectly from reduced levels of Dhh1 in the scd6∆edc3∆ mutant. The predicted increased in Dhh1 expression might signify a compensatory response to the absence of Scd6/Edc3.  We also observed an ~40% reduction in Dcp2 translation (RPFs) and mRNA abundance in the scd6∆edc3∆ strain, which might contribute to the up-regulation of mRNAs dysregulated in this mutant. However, our new immunoblot analyses revealed no significant reduction in steady-state Dcp2 levels in scd6∆edc3∆ cells (Input lanes in Figs. 3F and S4C(i)-(ii)). Moreover, our previous finding that the majority of mRNAs subject to NMD, up-regulated by both upf1∆ and dcp2∆, are not upregulated by scd6∆edc3∆ implies that Dcp2 abundance in scd6∆edc3∆ cells is adequate for normal levels of NMD and favors a direct role for Scd6/Edc3 in accelerating degradation of most transcripts up-regulated in this mutant. We have added these points to the DISCUSSION.

      (3) While not essential, it would be interesting if the authors carried out add-back experiments to determine which domain within Scd6/Edce3 plays a critical role in enforcing the regulation that they see. Their double mutant now puts them in a perfect position to carry out such experiments. 

      We agree with the reviewer that our scd6∆edc3∆ strain provides an opportunity to dissect the Scd6 and Edc3 proteins to determine which domains and motifs of each protein are most critically required for their functions in activating mRNA decay. However, if conducted thoroughly, this would entail an extensive analysis requiring a combination of genetics, biochemistry and genomics.  Considering the large amount of data already presented in 43 and 34 panels of main and supplementary figures, respectively, we feel that these additional experiments would be conducted more appropriately as a stand-alone follow-up study.

      Reviewer #2 (Public review): 

      Weaknesses: 

      The authors show very nicely in Figure S1A that growth phenotypes from scd6Δedc3∆ can be rescued by transformation of EDC3 (pLfz614-7) or SCD6 (pLfz615-5). The manuscript might benefit from using these rescue strategies in the analysis performed (e.g. RNA-seq, ribosome occupancies, and translational efficiencies). Also, these rescue assays could provide a good platform to further characterise the protein-protein interactions between Edc3, Scd6, and Dhh1. 

      We responded to this point immediately above in responding to Rev. #1.

      Reviewer #3 (Public review): 

      Weaknesses: 

      The limitations of the study include the use of indirect evidence to support claims that Edc3 and Scd6 recruit Dhh1 to the Dcp2 complex, which is inferred from correlations in mRNA abundance and ribosome profiling data rather than direct biochemical evidence. 

      While the reviewer makes a valid point, it is important to note that the greater correlations between effects of scd6∆edc3∆ with those conferred by dhh1∆ vs. pat1∆ also extended to changes in metabolites (Fig. 7A-C). To provide more direct evidence that Edc3 and Scd6 recruit Dhh1 to the Dcp2 complex, we have now conducted co-immunoprecipitation experiments (presented in new Figs. 3F and S5) demonstrating that association of Dhh1 with Dcp2 is diminished in the scd6∆edc3∆ double mutant but not in either scd6∆ or edc3∆ single mutant, thus providing biochemical support for our proposal.

      Also, there is limited exploration of other signals as the study is focused on glucose availability, and it is unclear whether the findings would apply broadly across different environmental stresses or metabolic pathways. Nonetheless, the study provides new insights into how mRNA decapping and degradation are tightly linked to metabolic regulation and nutrient responses in yeast. The RNA-seq and ribosome profiling datasets are valuable resources for the scientific community, providing quantitative information on the role of decapping activators in mRNA stability and translation control. 

      While not disputing the facts of this comment, we think it is unjustified to label as a weakness that our study focused on glucose-grown cells considering the large amount of new data and insights made possible by our multi-omics approach, presented in >70 separate figure panels and nine supplementary datafiles, which the reviewer has characterized as being valuable to the scientific community.  Parallel studies in non-preferred carbon or nitrogen sources are underway and represent large-scale investigations in their own right, for which the current dataset in glucose-replete cells provides the critical reference condition.

      Reviewer #1 (Recommendations for the authors): 

      The authors made a note that a set of 37 mRNAs is repressed exclusively by Edc3 with little contribution by Scd6, a list that includes the RPS28B mRNA. Edc3 has been previously reported to promote the decay of this mRNA in a deadenylation-independent fashion by binding to an element in its 3'UTR (PMIDs 15225544, 24492965). Can the authors comment on whether Edc3 may be binding to similar elements in the 3'UTRs of these transcripts in their shortlist? This could be an interesting topic matter for discussion as well. 

      While an interesting idea, this seems unlikely because the 3’UTR sequence in RPS28B mRNA was shown to bind Rps28 protein itself to confer heightened decapping and decay dependent on Edc3 in a negative autoregulatory loop that exerts tight control over Rps28 protein levels.  It would be surprising if Edc3mediated repression of the other 36 mRNAs would involve Rps28 as none of them encode cytoplasmic ribosomal proteins. Nevertheless, we searched for a conserved motif among the 3’UTRs of the 37 mRNAs using the MEME suite and found enrichment for motifs identified for RNA binding proteins Hrp1 and Nab2 and two novel motifs, but none of these motifs could be recognized within in the Rps28 autoregulatory loop.  We have chosen not to comment on these findings in the revised manuscript to avoid lengthening it unnecessarily with inconclusive observations.

      Reviewer #2 (Recommendations for the authors): 

      The authors show very nicely in Figure S1A that growth phenotypes from scd6Δedc3∆ can be rescued by the transformation of EDC3 (pLfz614-7) or SCD6 (pLfz615-5). The manuscript might benefit from using these rescue strategies on the analysis performed (e.g. RNA-seq, ribosome occupancies, and translational efficiencies); or expressing truncated mutants of EDC3 (pLfz614-7) or SCD6 (pLfz615-5), to show that they can act as dominant negative competitors, either on the binding to Dhh1 and Dcp2. 

      We addressed this comment above in our response to this Reviewer.

      Reviewer #3 (Recommendations for the authors): 

      (1) Labels such as "mRNA_up_s6,e3" are not defined in figures or the text. I suggest clearer sample labeling throughout. 

      The labels had been defined at first mention in the RESULTS but are now indicated there more explicitly, as well as in the legend to Fig. 1.

      (2) In Figure 1D it is surprising that the mRNA profile has a peak in the 5' UTR. I would expect to see such a peak in ribosome footprinting data. Is it possible these are incorrectly labeled?

      The figure is correctly labeled. Generally, one does not expect to see RPFs in the 5’UTR region unless there is an efficiently translated uORF, which appears not to be the case for MDH2.

      In general, the information in this panel and C is inadequate. None of the numbers are clearly explained in the figure legend or in the figure. 

      We had cited the legend to Fig. S3C for details of all such gene browser images but have now inserted this information into the Fig. 1D legend, at the first occurrence of such data in the regular figures. 

      (3) Figures 1C and 1D are in the wrong order.

      Corrected.

      (4) Figure 2D is a very complicated Venn Diagram. I suggest using UpSet plots as an alternative to Venn diagrams to more clearly convey overlaps between sets.  

      We provided additional explanatory text in the Fig. 2D legend to facilitate understanding.

      (5) The use of the same color scheme to represent different sets in panels of the same figure is a source of confusion. E.g. the cyan in Figures 2A, 2D, and 2E indicates unrelated categories, but one would think they are related.

      The use of the same cyan color in these three figure panels actually does designate results for the same set of 591 mRNAs up-regulated in the three mutants.  The application of the color schemes is now mentioned explicitly in Figs. 1, 2, and S3.

      (6) Reporting of p-values = 0 in figures is not useful.

      Corrected.

      (7) The whole manuscript is extremely long which reduces the overall impact. For example, the introduction is six pages long. I suggest reducing redundant text and being more concise to enhance readability. 

      We tried to streamline the text wherever possible, in particular shortening the Introduction by two pages.

      (8) Many abbreviations are used throughout the text that are not introduced the first time they are used. 

      Corrected throughout.

      (9) The ERCC normalization is unclear. Were the spike-ins added before cell lysis to allow estimation of per-cell RNA counts or to the extracted RNA? If added to extracted RNA rather than cells it is not clear to me how the claim can be made regarding increased mRNA abundance in the mutants. 

      We thank the reviewer for this comment. As we explained in the Methods, 2.4 µl of 1:100 diluted ERCC RNA Spike-In Control Mix 1 was added to 1.2 µg of each total RNA sample prior to cDNA library preparation.  Because the majority of total mRNA is comprised of rRNA, this normalization yields the abundance of each mRNA relative to rRNA. Owing to repression of rESR mRNAs encoding ribosomal proteins and biogenesis factors in the scd6∆edc3∆ strain (Fig. S3D), the ribosome content per cell is expected to be reduced in this mutant vs. WT. We showed previously that the isogenic dcp2∆ mutant that elicits an ESR response of similar magnitude, showed a 30% reduction in bulk ribosomal subunits per cell compared to same WT strain examined here {Vijjamarri, 2023 #7866}.  Assuming a similar reduction in ribosome abundance in the scd6∆edc3∆ mutant, the changes in mRNA per cell conferred by the scd6∆edc3∆ mutation are expected to be 0.7-fold of the ERCCnormalized values given in Fig. 3E, yielding fold-changes of 2.00 and 0.62 for the mRNA_up and mRNA_dn, groups, respectively, which still differ substantially from the corresponding changes in normalized Rpb1 occupancies of 1.2 and 0.93, respectively.  We have added this new analysis to the text of RESULTS.

      (10) The use of the terms "up-regulated" and "derepressed" throughout is confusing. Both refer to observed increased abundance of mRNAs, but they imply different causes which are never clearly defined. 

      We changed all occurrences of “derepressed” to “up-regulated”.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #2 (Public review):

      (1)  The sharpening model of expectation can predict surround suppression. The authors could further clarify how the cancellation model predicts a monotonic profile of expectation (Figure 1C) with the highest response at the expected orientation, while the cancellation model suggests a suppression of neurons tuned toward the expected stimulus.

      We thank the reviewer for the comment. We would like to emphasize that as the expected signal is suppressed, the relative weight or salience of unexpected inputs increases. We have clarified this interpretation in the manuscript as follows:

      “Here, given these two mechanisms making opposite predictions about how expectation changes the neural responses of unexpected stimuli, thereby displaying different profiles of expectation, we speculated that if expectation operates by the sharpening model with suppressing unexpected information, we should observe an inhibitory zone surrounding the focus of expectation, and its profile then should display as a center-surround inhibition (Fig. 1c, left). If, however, expectation operates as suggested by the cancelation model with highlighting unexpected information, the inhibitory zone surrounding the focus of expectation should be eliminated, and the profile should instead display a monotonic gradient (Fig. 1c, right).”

      (2) I'm a bit concerned about whether the profile solely arises from modulation of expectation. The two auditory cues are each associated with a fixed orientation, which may be confounded by other cognitive processes like visual working memory or attention (which I think the authors also discussed). Although the authors tried to use SFD task to render orientation task-irrelevant, luminance edges (i.e., orientation) and spatial frequency in gratings are highly intertwined and orientation of the gratings may help recall the first grating's SF (fixed at 0.9 c/{degree sign}), especially given the first and second grating's orientations are not very different (4.8{degree sign}).

      We agree that dissociating expectation from attention and other top-down processes remains a key challenge in visual expectation research (see Summerfield & Egner, 2009; Summerfield & de Lange, 2014; de Lange et al., 2018). As is generally acknowledged, expectation reflects the probability of a sensory event, while selective attention relates to its behavioral relevance. To minimize attentional influences, our task design ensured that grating orientation was not taskrelevant: on each trial, participants discriminated either orientation or spatial frequency difference, such that orientation itself did not require attentional allocation, a point already discussed in the manuscript.

      Regarding visual working memory, we argue that even if participants recalled the first grating’s spatial frequency in the SFD task, they were not required to retain its precise spatial frequency (or orientation), as their task was simply to judge whether the second grating appeared denser or sparser. In other words, orientation (or spatial frequency) itself was not task-relevant. Moreover, although not included in the manuscript, we conducted a post-experiment debriefing in which participants were asked whether they noticed any association between the auditory tone and the grating orientation. None of the participants reported this relationship correctly, suggesting that the tone-orientation mapping remained implicit and was unlikely to be driven by strategic attention or memory.

      However, we acknowledge that certain confounding processes such as statistical learning or implicit mapping acquisition cannot be fully ruled out given the current paradigm. Future studies using methods with higher temporal resolution (e.g., EEG/MEG) may help to dissociate these mechanisms more precisely.

      (3) For each of the expected orientations (20{degree sign} or 70{degree sign}), the unexpected ones are linearly separable (i.e., all unexpected ones lie on one side of the expected angle). This might further encourage people to shift their attended or expected orientation, according to the optimal tuning hypothesis. Would this provide an alternative explanation to the tuning shift that the authors found?

      We thank the reviewer for pointing out the relevance of the optimal tuning hypothesis. We acknowledge that the optimal tuning theory (Navalpakkam & Itti, 2007) is an important framework, particularly in visual search paradigms, where attentional templates may shift away from non-target features to enhance discriminability.

      In our task, this hypothesis would predict a shift of expectation toward <20° in E20° trials and >70° in E70° trials, given that all unexpected orientations lie on one side of the expected angle. Importantly, the optimal tuning hypothesis predicts such shifts not only in Δ20°, Δ25°, and Δ30° trials but also in the Δ0° trials. In this regard, the observed shift in Δ20° and Δ30° (Experiment 2) and Δ25° (Experiment 3) trials is broadly consistent with the predictions of the optimal tuning account. However, we did not observe a corresponding shift away from nontarget features in the Δ0° condition, suggesting limited behavioral evidence for optimal tuning effects under our current task settings.

      It is important to note that most previous studies supporting optimal tuning (e.g., Navalpakkam & Itti, 2007; Scolari & Serences, 2009; Geng, DiQuattro, & Helm, 2017; Yu & Geng, 2019) have used visual search paradigms that differ from our design in several critical ways, including the number of stimuli presented, their spatial arrangement (eccentricity), task demands, and so on. Therefore, it is difficult to determine whether the optimal tuning hypothesis could serve as an alternative explanation within the context of our current study. We agree that future studies could further examine how such task parameters influence the presence or absence of optimal tuning.

      (4) It is great that the authors conducted computational modeling to elucidate the potential neuronal mechanisms of expectation. But I think the sharpening hypothesis (e.g., reviewed in de Lange, Heilbron & Kok, 2018) focuses on the neural population level, i.e., narrowing of population tuning profile, while the authors conducted the sharpening at the neuronal tuning level. However, the sharpening of population does not necessarily rely on the sharpening of individual neuronal tuning. For example, neuronal gain modulation can also account for such population sharpening. I think similar logic applies to the orientation adjustment experiment. The behavioral level shift does not necessarily suggest a similar shift at the neuronal level. I would recommend that the authors comment on this.

      We thank the reviewer for this to-the-point comment. As de Lange et al. (2018) noted, “there is not always a direct correspondence between neural-level and voxel-level selectivity patterns.” That is, neuronal tuning, population-level tuning, voxel-level selectivity, and behavioral adaptive outcomes may reflect different underlying mechanisms and do not necessarily align in a one-toone fashion. We fully acknowledge that population-level tuning effects may also result from various neuronal mechanisms such as gain modulation (for review, see Salinas & Thier, 2000), shifts in preferred orientation (Ringach, et al., 1997; Jeyabalaratnam et al., 2013), asymmetric broadening of tuning curves (Schumacher et al., 2022), or tuning curve sharpening (Ringach, et al., 1997; Schoups et al., 2001).  

      In our modeling, we implemented sharpening and shifts of neuronal tuning curves as a conceptual model simplification, intended to explore potential mechanisms underlying expectation-related center-surround suppression effects. While sharpening-based accounts (e.g., Kok et al. 2012) have often been emphasized, we stress that other mechanisms, such as gain modulation or tuning shifts, may also contribute. Our goal is not to provide a definitive account, but to highlight such plausible mechanisms and encourage future investigation. We have revised the Discussion to emphasize that multiple mechanisms may underlie the observed effects.

      “We note that our implementation of sharpening and shifts at the neuronal level serves as a conceptual model simplification, as population-level tuning, voxel-level selectivity, and behavioral adaptive outcomes may reflect different underlying neuronal mechanisms and do not necessarily align in a one-to-one fashion. Here, we stress that other potential mechanisms beyond sharpening, such as tuning shifts, may also contribute to visual expectation.” 

      (5) If the orientation adjustment experiment suggests that both sharpening and shifting are present at the same time, have the authors tried combining both in their computational model?

      We agree with the reviewer that it is necessary to consider the combined model. Accordingly, we implemented a computational model incorporating sharpening of the expected orientation channel together with shifting of the unexpected orientation channels. This model

      successfully captured the sharpening of the expected-orientation channel and the shift of the unexpectedorientation channels (Supplementary Fig. 3). For the expected orientation (Δ0°) , results showed that the amplitude change was significantly higher than zero on both OD (t(23) = 2.582, p = 0.017, Cohen’s d = 0.527) and SFD (t(23) = 2.078, p = 0.049, Cohen’s d = 0.424) tasks (Supplementary Fig. 3e, vertical stripes); the width change was significantly lower than zero on both OD (t(23) = -2.438, p = 0.023, Cohen’s d = 0.498) and SFD (t(23) = -2.578, p = 0.017, Cohen’s d = 0.526) tasks (Supplementary Fig. 3e, diagonal stripes). For unexpected orientations (Δ10°-Δ40°), however, the amplitude and width changes were not significant with zero on either OD (amplitude change: t(23) = 0.443, p = 0.662, Cohen’s d = 0.091; width change: t(23) = -1.819, p = 0.082, Cohen’s d = 0.371) or SFD (amplitude change: t(23) = 1.130, p = 0.270, Cohen’s d = 0.231; width change: t(23) = -1.710, p = 0.101, Cohen’s d = 0.349) tasks (Supplementary Fig. 3f). In the meantime, the location shift was significantly different than zero for unexpected orientations (Δ10°-Δ40°, OD task: t(23) = 3.611, p = 0.001, Cohen’s d = 0.737; SFD task: t(23) = 2.418, p = 0.024, Cohen’s d = 0.493 (Supplementary Fig. 3g). These results provided further evidence that tuning sharpening and tuning shift jointly contribute to center– surround inhibition in expectation.  

      Reviewer#1 (Recommendation for the Author):

      (1) A direct comparison between tasks (baseline vs. expectation conditions) would have strengthened the findings. Specifically, contrasting performance in the orientation discrimination task with the spatial frequency discrimination task could have provided clearer evidence that participants actually used the auditory cues to attend to the expected orientation. This comparison would be particularly important for validating cue manipulation in the orientation discrimination task.

      We agree that a direct comparison between the orientation discrimination (OD) and spatial frequency discrimination (SFD) tasks could further clarify how expectation (auditory cues) differentially modulates orientation relevance. However, the primary goal of the current study was to examine expectation effects within each task separately and to demonstrate that such effects are independent of attentional modulation driven by the task-relevance of orientation.

      In addition, the OD and SFD tasks differ not only in the relevant task features (orientation vs. spatial frequency discrimination), but also in stimulus properties and difficulty, for example, the arbitrary use of 20–70° as the orientation range and ~0.9 cycles/° as the spatial frequency setting, a direct comparison could introduce confounding factors unrelated to expectation.

      Importantly, Previous studies (e.g., Kok et al., 2012, 2017; Aitken et al., 2020) and our current results show that participants performed significantly better when the auditory cue matched the expected orientation, supporting the validity of our expectation manipulation.

      (2) An interesting consideration is why the center-surround inhibition profile of expectation was independent of the task-relevance of orientation. Previous studies (e.g., Kok et al., 2012) have found that orientation discrimination patterns differ depending on whether orientation is taskrelevant or irrelevant. This could be useful to discuss the possible discrepancies.

      We thank the reviewer for this inspiring comment. Kok et al. (2012) showed that both orientation and contrast tasks elicited similar fMRI decoding results, regardless of task relevance, suggesting neural mechanisms of expectation operate independently of whether orientation is task relevant. Behaviorally, they reported better performance for expected versus unexpected trials in the orientation task (3.4° vs. 3.8°, t(17) = 2.8, p = 0.013), and a marginal trend (although not significant) in the contrast task (4.3% vs. 5.0%, t(17) = 1.9, p = 0.075). If any differences between the two tasks exist, they may lie in the correlation between behavioral and fMRI effects, a question that goes beyond the scope of the current study. Therefore, it is hard to strongly conclude that orientation discrimination patterns differ depending on whether orientation is taskrelevant or irrelevant in their paper.

      Our study differs from theirs in at least two important ways, which may account for the clearer expectation facilitatory effect we observed in the expectation (Δ0°) condition. First, in our study, the orientation-irrelevant task involved spatial frequency discrimination (SFD) rather than contrast discrimination. Compared to contrast, spatial frequency has been shown to exhibit a clear cueing effect, as reported in Fang & Liu (2019). Second, our design included a baseline condition, which was absent in their study. We computed discrimination sensitivity (DS) to quantify how much the discrimination threshold (DT) changed relative to baseline. By using this baseline-referenced approach, we observed a significant facilitatory expectation effect in the Δ0° condition, an effect that shifted from marginal significance in their orientation-irrelevant task to clear significance in our study.

      (3) The authors might consider briefly explaining how the orientation adjustment paradigm used in this study is particularly effective for examining the potential co-existence of tuning sharpening and tuning shift computations, and how this approach complements traditional orientation discrimination tasks in characterizing expectation-related mechanisms.

      We thank the reviewer for this valuable suggestion. We agree that further clarification is needed to better connect the two experiments. To explain this, we have elaborated further in the manuscript.

      “To further explore the co-existence of both Tuning sharpening and Tuning shift computations in center-surround inhibition profile of expectation, participants were asked to perform a classic orientation adjustment experiment. Unlike profile experiment (discrimination tasks), the adjustment experiment provides a direct, trial-by-trial measure of participants’ perceived orientation, capturing the full distribution of responses. This enables the construction of orientation-specific tuning curves, allowing us to detect both tuning sharpening and tuning shifts, thereby offering a more nuanced understanding of the computational mechanisms underlying expectation.”

      (4) These interesting findings raise important questions about their relationship to existing hybrid models of attentional modulation. Could the authors discuss how their results might align with or extend previous work demonstrating combined feature-similarity gain and surround suppression effects for orientation (e.g., Fang & Liu, 2019)? Could a hybrid model potentially provide a better account of these data than the pure surround suppression model?

      We thank the reviewer for this valuable comment. We agree that hybrid model should be mentioned in the manuscript and we have elaborated further in the Discussion.

      “For example, within the orientation space, the inhibitory zone was about 20°, 45°, and 54° for expectation evident here, feature-based attention[21], and visual perceptual learning[35], respectively; within the feature-based attention, it was about 30° and 45° in color [77] and motion direction [53] spaces, respectively These variations hint at the exciting possibility that the width of the inhibitory surround may flexibly adapt to stimulus context and task demands, ultimately facilitating our perception and behavior in a changing environment. This principle is consistent with the hybrid model of feature-based attention [53,54,75], where attention is deployed adaptively to prioritize task-relevant information through feature-similarity gain which filters out the most distinctive distractors, and surround suppression which inhibits similar and confusable ones, thereby jointly shaping the attentional tuning profile.”

      (5) On page 19, there appears to be a missing symbol in the description of the Tuning Sharpening model. The text states: 'the tuning width of each channel's tuning function is parameterized by ??', where the question marks seem to indicate a missing parameter symbol.

      We appreciate the reviewer’s careful attention. Yes, the "ơ" is missing, which was likely caused by a formatting issue. We have corrected it.

    1. Author response:

      The following is the authors’ response to the previous reviews.

      Reviewer #1 (Public review):

      Summary:

      This work investigated how the sense of control influences perceptions of stress. In a novel "Wheel Stopping" task, the authors used task variations in difficulty and controllability to measure and manipulate perceived control in two large cohorts of online participants. The authors first show that their behavioral task has good internal consistency and external validity, showing that perceived control during the task was linked to relevant measures of anxiety, depression, and locus of control. Most importantly, manipulating controllability in the task led to reduced subjective stress, showing a direct impact of control on stress perception. However, this work has minor limitations due to the design of the stressor manipulations/measurements and the necessary logistics associated with online versus in-person stress studies.

      Nevertheless, this research adds to our understanding of when and how control can influence the effects of stress and is particularly relevant to mental health interventions.

      We thank the reviewer for their clear and accurate summary of the findings. 

      Strengths:

      The primary strength of this research is the development of a unique and clever task design that can reliably and validly elicit variations in beliefs about control. Impressively, higher subjective control in the task was associated with decreased psychopathology measures such an anxiety and depression in a non-clinical sample of participants. In addition, the authors found that lower control and higher difficulty in the task led to higher perceived stress, suggesting that the task can reliably manipulate perceptions of stress. Prior tasks have not included both controllability and difficulty in this manner and have not directly tested the direct influence of these factors on incidental stress, making this work both novel and important for the field.

      We thank the reviewer for their positive comments.

      Weaknesses:

      One minor weakness of this research is the validity of the online stress measurements and manipulations. In this study, the authors measure subjective stress via self-report both during the task and also after either a Trier Social Stress Test (high-stress condition) or a memory test (low-stress condition). One concern is that these stress manipulations were really "threats" of stress, where participants never had to complete the stress tasks (i.e., recording a speech for judgment). While this is not unusual for an in-lab study and can reliably elicit substantial stress/anxiety, in an online study, there is a possibility for communication between participants (via online forums dedicated to such communication), which could weaken the stress effects. That said, the authors did find sensible increases and decreases of perceived stress between relevant time points, but future work could improve upon this design by including more complete stress manipulations and measuring implicit physiological signs of stress.

      We thank the reviewer for urging us to expand on this point. The reviewer is right that stress was merely anticipatory and is in that sense different to the canonical TSST. However, there are ample demonstrations that such anticipatory stress inductions are effective at reliably eliciting physiological and psychological stress responses (e.g. Nasso et al., 2019; Schlatter et al., 2021; Steinbeis et al., 2015). Further, there is evidence that online versions of the TSST are also effective (DuPont et al., 2022; Meier et al., 2022), including evidence that the speech preparation phase conducted online was related to increases in heart rate and blood pressure (DuPont et al., 2022). Importantly, and as the reviewer notes in relation to our study specifically, the anticipatory TSST had a significant impact on subjective stress in the expected direction demonstrating that it was effective at eliciting subjective stress. We have elaborated further on this in our manuscript (pages 8 and 9) as follows: 

      “Prior research has found TSST anticipation to elicit both psychological and physiological stress responses [37-39], suggesting that the task anticipation would be a valid stress induction despite participants not performing the speech task. Moreover, prior research has validated the use of remote TSST in online settings [40, 41], including evidence that the speech preparation phase (online) was related to increased heart rate and blood pressure compared to controls [40].”

      Reviewer #2 (Public review):

      Summary:

      The authors have developed a behavioral paradigm to experimentally manipulate the sense of control experienced by the participants by changing the level of difficulty of a wheel-stopping task. In the first study, this manipulation is tested by administering the task in a factorial design with two levels of controllability and two levels of stressor intensity to a large number of participants online while simultaneously recording subjective ratings on perceived control, anxiety, and stress. In the second study, the authors used the wheel-stopping task to induce a high sense of controllability and test whether this manipulation buffers the response to a subsequent stress induction when compared to a neutral task, like looking at pleasant videos.

      We thank the reviewer for their accurate summary.

      Strengths:

      (1) The authors validate a method to manipulate stress.

      (2) The authors use an experimental manipulation to induce an enhanced sense of controllability to test its impact on the response to stress induction.

      (3) The studies involved big sample sizes.

      We thank the reviewer for noting these positive aspects of our study. 

      Weaknesses:

      (1) The study was not preregistered.

      This is correct.

      (2) The control manipulation is conflated with task difficulty, and, therefore the reward rate. Although the authors acknowledge this limitation at the end of the discussion, it is a very important limitation, and its implications are not properly discussed. The discussion states that this is a common limitation with previous studies of control but omits that many studies have controlled for it using yoking.

      We agree that these are very important issues to consider in the interpretation of our findings. It is important to note, that while our task design does not separate these constructs, we are able to do so in our statistical analyses. For example, our measure of perceived difficulty was included in analyses assessing the fluctuations in stress and control in which subjective control still had a unique effect on the experience of stress over and above perceived difficulty, suggesting that subjective control explains variance in stress beyond what is accounted for by perceived difficulty. Similarly, we have also included additional analyses in which we include the win rate (i.e. percentage of trials won) as a covariate when assessing the relationship between subjective control, perceived difficulty and subjective stress, in which subjective control and perceived difficulty still uniquely predict subjective stress when controlling for win rate. This suggests that there is unique variance in subjective control, separate from perceived task difficulty and win rate that is relevant to stress. We have included these analyses (page 16 of manuscript) as follows:

      “To further isolate the relationship between subjective control and stress separate from perceived task difficulty or objective task performance, we also included the overall win rate (percentage of trials won during the WS task) in the models. In Study 1, lower feelings of control were related to higher levels of subjective stress (β= -0.12, p<.001) even when controlling for both  win rate (β= -0.06, p=.220) and perceived task difficulty (β= 0.37, p<.001, Table S10). This also replicated in Study 2, where lower subjective control was associated with higher feelings of stress (β= -0.32, p<.001) when controlling for perceived task difficulty (β= 0.31, p<.001) and win rate (β= -0.11, p=.428, Table S11). This suggests that there is unique variance in subjective feelings of control, separate from task performance, relevant to subjective stress.”

      As well as expanding on this in the Discussion (pages 27 and 28) as follows:

      “While our task design does not separate control from obtained reward, we are able to do so in the statistical analyses. Like with perceived difficulty, we statistically accounted for reward rate and showed that the relationship between subjective control and stress was not accounted for by reward rate, for example. Similarly, participants received feedback after every trial, and thus feedback valence may contribute to stress perception. However, given that overall win rate (which captures the feedback received during the task) did not predict stress over and above perceived difficulty or subjective control, it suggests that feedback is unlikely to relate to stress over and above difficulty. Future work will need to disentangle this further to rule out such potential confounds.”

      Further, in terms of the wider literature on these issues, we have added more to this point in our discussion, especially in relation to previous literature that also varies control by reward rate (e.g. Dorfman & Gershman, 2019, who use a reward rate of 80% in high control conditions and 50% in low control conditions). This can be found in the manuscript on page 27 as follows: 

      “Previous research typically accounts for different outcomes (e.g. punishment) by yoking controllable and uncontrollable conditions [3] though other work has manipulated the controllability of rewards by changing the reward rate [for example 30] where a decoy stimulus is rewarded 50% of the time in the low control condition but 80% in the high control condition).”

      (3) The methods are not always clear enough, and it is difficult to know whether all the manipulations are done within-subjects or some key manipulations are done between subjects.

      We have added more information in the methods section (page 8) clarifying withinsubject manipulations (WS task parameters) and between-subject manipulations (stressor intensity task, WS task version in Study 1, and WS task/video task in Study 2). Additionally, as recommended by Reviewer 1, we have provided more information in the methods section and Table S3 regarding the details of on-screen written feedback provided to participants after each trial of the WS Task.

      (4) The analysis of internal consistency is based on splitting the data into odd/even sliders. This choice of data parcellation may cause missed drifts in task performance due to learning, practice effects, or tiredness, thus potentially inflating internal consistency.

      We agree that this can indeed be an issue, though drift is likely to be present in any task including even in mood in resting-state (Jangraw et al., 2023). To respond to this specific point, we parcellated the timepoints into a 1<sup>st</sup>/2<sup>nd</sup> half split and report the ICC in the supplementary information. While values are lower, indeed likely due to systematic drifts in task performance as participants learn to perform the task (especially for Study 2 since the order of parameters were designed to get easier throughout the experiment), the ICC values are still high. Control sliders: Study 1 = 0.82, Study 2: = 0.68; Difficulty sliders: Study 1: = 0.84, Study 2 = 0.57; Stress sliders: Study 1 = 0.45, Study 2 = 0.71. As seen, the lowest ICC is for stress sliders in Study 1. This may be because the first 3 sliders (included in the 1<sup>st</sup> half split) were all related to the stress task (initial, post-stress, task, post-debrief) and the final 4 sliders (in the 2<sup>nd</sup> half split) were the three sliders during the WS task and shortly afterwards. 

      (5) Study 2 manipulates the effect of domain (win versus loss WS task), but the interaction of this factor with stressor intensity is not included in the analysis.

      We agree that this would be a valuable analysis to include. We have run additional analyses (section Sensitivity and Exploratory Analyses, pages 24 and 25), testing the interaction of Domain (win or loss) with stressor intensity (and time) when predicting the stress buffering and stress relief effects. This revealed no significant main effects of domain or interactions including domain, suggesting that domain did not impact the stress induction or relief differently depending on whether it was followed by the high or low stressor intensity condition. While the control by time interaction (our main effect of interest) still held for stress induction in this more complex model, the control by time interaction did not hold for the stress relief. However, this more complex model did not provide a better fit for the data, motivating us to continue to draw conclusions from the original model specification with domain as a covariate (rather than an interaction).

      We outline these analyses on page 24 of the manuscript, as follows:

      “Third, we included the interaction of domain with stressor intensity and with time, to test whether the win or loss domain in the WS task significantly impacted stress induction or stress relief differently depending on stressor intensity. There were no significant effects or interactions of domain (Table S14) for stress induction or stress relief, and the main effect of interest (the interaction between time and control) still held for the stress induction (β= 10.20, SE=4.99 p=.041, Table S14), though was no longer significant for the stress relief  (β= 6.72, SE=4.28, p=.117, Table S14). This more complex model did not significantly improve model fit (χ<sup>²</sup>(3)= 1.46, p=.691) compared to our original specification (with domain as a covariate rather than an interaction) and had slightly worse fit (higher AIC and BIC) than the original model (AIC = 5477.2 versus 5472.7, BIC = 5538.5 versus 5520.8).”

      This study will be of interest to psychologists and cognitive scientists interested in understanding how controllability and its subjective perception impact how people respond to stress exposure. Demonstrating that an increased sense of control buffers/protects against subsequent stress is important and may trigger further studies to characterize this phenomenon better. However, beyond the highlighted weaknesses, the current study only studied the effect of stress induction consecutive to the performance of the WS task on the same day and its generalizability is not warranted.

      We thank the reviewer for this assessment and agree that we cannot assume these findings would generalise to more prolonged effects on stress responses.

      Reviewer #3 (Public review):

      Summary:

      This is an interesting investigation of the benefits of perceiving control and its impact on the subjective experience of stress. To assess a subjective sense of control, the authors introduce a novel wheel-stopping (WS) task where control is manipulated via size and speed to induce low and high control conditions. The authors demonstrate that the subjective sense of control is associated with experienced subjective stress and individual differences related to mental health measures. In a second experiment, they further show that an increased sense of control buffers subjective stress induced by a trier social stress manipulation, more so than a more typical stress buffering mechanism of watching neutral/calming videos.

      We agree with this accurate summary of our study. 

      Strengths:

      There are several strengths to the manuscript that can be highlighted. For instance, the paper introduces a new paradigm and a clever manipulation to test an important and significant question. Additionally, it is a well-powered investigation that allows for confidence in replicability and the ability to show both high internal consistency and high external validity with an interesting set of individual difference analyses. Finally, the results are quite interesting and support prior literature while also providing a significant contribution to the field with respect to understanding the benefits of perceiving control.

      We thank the reviewer for this positive assessment. 

      Weaknesses:

      There are also some questions that, if addressed, could help our readership.

      (1) A key manipulation was the high-intensity stressor (Anticipatory TSST signal), which was measured via subjective ratings recorded on a sliding scale at different intervals during testing. Typically, the TSST conducted in the lab is associated with increases in cortisol assessments and physiological responses (e.g., skin conductance and heart rate). The current study is limited to subjective measures of stress, given the online nature of the study. Since TSST online may also yield psychologically different results than in the lab (i.e., presumably in a comfortable environment, not facing a panel of judges), it would be helpful for the authors to briefly discuss how the subjective results compare with other examples from the literature (either online or in the lab). The question is whether the experienced stress was sufficiently stressful given that it was online and measured via subjective reports. The control condition (low intensity via reading recipes) is helpful, but the low-intensity stress does not seem to differ from baseline readings at the beginning of the experiment.

      We agree that it would be helpful to expand on this further. Similar to the comment made by Reviewer 1, we wish to point out that there are ample demonstrations that such anticipatory stress inductions are effective at reliably eliciting physiological and psychological stress responses (e.g. Nasso et al., 2019; Schlatter et al., 2021; Steinbeis et al., 2015). Further, there is evidence that online versions of the TSST are also effective (DuPont et al., 2022; Meier et al., 2022), including evidence that the speech preparation phase conducted online was related to increases in heart rate and blood pressure (DuPont et al., 2022). We have elaborated further on this in our manuscript on pages 8 and 9 as follows:

      “Prior research has found TSST anticipation to elicit both psychological and physiological stress responses [37-39], suggesting that the task anticipation would be a valid stress induction despite participants not performing the speech task. Moreover, prior research has validated the use of remote TSST in online settings [40, 41], including evidence that the speech preparation phase (online) was related to increased heart rate and blood pressure compared to controls [40].”

      (2) The neutral videos represent an important condition to contrast with WS, but it raises two questions. First, the conditions are quite different in terms of experience, and it is interesting to consider what another more active (but not controlled per se) condition would be in comparison to the WS performance. That is, there is no instrumental action during the neutral video viewing (even passive ratings about the video), and the active demands could be an important component of the ability to mitigate stress. Second, the subjective ratings of the stress of the neutral video appear equivalent to the win condition. Would it have been useful to have a high arousal video (akin to the loss condition) to test the idea that experience of control will buffer against stress? That way, the subjective stress experience of stress would start at equivalent points after WS3.

      We agree with the reviewer that this is an important issue to clarify. In our deliberations when designing this study, we considered that that any task with actionoutcome contingencies would have a degree of controllability. To better distinguish experiences of control (WS task) to an experience of no/neutral control (i.e., neither high nor low controllability), we decided to use a task in which no actions were required during the task itself. Importantly, however, there was an active demand and concentration was still required in order to perform the attention checks regarding the content of the videos and ratings of the videos. 

      Thank you for the suggestion of having a high arousal video condition. This would indeed be interesting to test how experiencing ‘neutral’ control and high(er) stress levels preceding the stressor task influences stress buffering and stress relief, and we have included this suggestion for future research in the discussion section (page 28) as below:

      “Another avenue for future research would be to test how control buffers against stress when compared to a neutral control scenario of higher stress levels, akin to the loss domain in the WS Task, given that participants found the video condition generally relaxing. However, given that we found no differences dependent on domain for the stress induction in the WS Task conditions, it is possible that different versions of a neutral control condition would not impact the stress induction.”

      (3) For the stress relief analysis, the authors included time points 2 and 3 (after the stressor and debrief) but not a baseline reading before stress. Given the potential baseline differences across conditions, can this decision be justified in the manuscript?

      We thank the reviewer for raising this. Regarding the stress relief analyses (timepoints 2 and 3) and not including timepoint 1 (after the WS/video task) stress in the model, we have added to the manuscript that there was no significant difference in stress ratings between the high control and neutral control (collapsed across stress and domain) at timepoint 1 (hence why we do not think it’s necessary to include in the stress relief model). Nevertheless, we have now included a sensitivity analysis to test the Timepoint*Control interaction of stress relief when including timepoint 1 stress as a covariate. The timepoint by control interaction still holds, suggesting that the initial stress level prior to the stress induction does not impact our results of interest. The details of this analysis are included in the Sensitivity and Exploratory Analyses section on page 24:

      “Although there were no significant differences between control groups in subjective stress immediately after the WS/video task (t(175.6)=1.17, p=.244), we included participants’ stress level after the WS/video task as a covariate in the stress relief analyses (Table S12). The results revealed a main effect of initial stress (β= 0.643, SE=0.040, p<.001, Table S12) on the stress relief after the stressor debrief. Compared to excluding initial stress as in the original analyses (Table 4), there was now no longer a main effect of domain (β= 0.236, SE=2.60, p=.093, Table S12), but the inference of all other effects remained the same. Importantly, there was still a significant time by control interaction (β= 9.65, SE=3.74, p=.010, Table S12) showing that the decrease in stress after the debrief was greater in the highly controllable WS condition than the neutral control video condition, even when accounting for the initial stress level.”

      (4) Is the increased control experience during the losses condition more valuable in mitigating experienced stress than the win condition?

      We agree that this would be helpful to clarify. To test whether the loss domain was more valuable at mitigating experiences of stress than the win condition, we ran additional analyses with just the high control condition (WS task) to test for a Domain*Time interaction. This revealed no significant Domain*Time interaction, suggesting that the stress buffering or stress relief effect was not dependent on domain in the high control conditions. These analyses are outlined in the Sensitivity and Exploratory Analyses section on page 25:

      “Finally, to test whether the loss domain was more valuable at mitigating experiences of stress than the win condition, we ran additional analyses with just the high control condition (WS task) for the stress induction and stress relief to test for an interaction of domain and time. For the stress induction, there was no significant two-way interaction of domain and time (β= -1.45, SE=4.80, p=.763), nor a significant three-way interaction of domain by time by stressor intensity (β= -3.96, SE=6.74, p=.557, Table S15), suggesting that there were no differences in the stress induction dependent on domain. Similarly for the stress relief, there was no significant two-way interaction of domain and time (β= -5.92, SE=4.42, p=.182), nor a significant three-way interaction of domain by time by stressor intensity interaction (β= 8.86, SE=6.21, p=.154, Table S15), suggesting that there were no differences in the stress relief dependent on the WS Task domain.

      (5) The subjective measure of control ("how in control do you feel right now") tends to follow a successful or failed attempt at the WS task. How much is the experience of control mediated by the degree of experienced success/schedule of reinforcement? Is it an assessment of control or, an evaluation of how well they are doing and/or resolution of uncertainty? An interesting paper by Cockburn et al. 2014 highlights the potential for positive prediction errors to enhance the desire for control.

      We thank the reviewer for this comment. Similar to comments regarding reward rate, our task does not allow us to fully separate control from success/reinforcement because of the manipulation of difficulty. However, we did undertake sensitivity analyses and the inclusion of overall win rate accounted for limited variance when predicting stress over and above subjective control and difficulty (page 16). 

      “To further isolate the relationship between subjective control and stress separate from perceived task difficulty or objective task performance, we also included the overall win rate (percentage of trials won during the WS task) in the models. In Study 1, lower feelings of control were related to higher levels of subjective stress (β= -0.12, p<.001) even when controlling for both  win rate (β= -0.06, p=.220) and perceived task difficulty (β= 0.37, p<.001, Table S10). This also replicated in Study 2, where lower subjective control was associated with higher feelings of stress (β= -0.32, p<.001) when controlling for perceived task difficulty (β= 0.31, p<.001) and win rate (β= -0.11, p=.428, Table S11). This suggests that there is unique variance in subjective feelings of control, separate from task performance, relevant to subjective stress.” 

      (6) While the authors do a very good job in their inclusion and synthesis of the relevant literature, they could also amplify some discussion in specific areas. For example, operationalizing task controllability via task difficulty is an interesting approach. It would be useful to discuss their approach (along with any others in the literature that have used it) and compare it to other typically used paradigms measuring control via presence or absence of choice, as mentioned by the authors briefly in the introduction.

      We are delighted to expand on this particular point and have done so in the Discussion on page 27:

      “Previous research typically accounts for different outcomes (e.g. punishment) by yoking controllable and uncontrollable conditions [3] though other work has manipulated the controllability of rewards by changing the reward rate [for example 30] where a decoy stimulus is rewarded 50% of the time in the low control condition but 80% in the high control condition). While our task design does not separate control from obtained reward, we are able to do so in the statistical analyses.” 

      (7) The paper is well-written. However, it would be useful to expand on Figure 1 to include a) separate figures for study 1 (currently not included) and 2, and b) a timeline that includes the measurements of subjective stress (incorporated in Figure 1). It would also be helpful to include Figure S4 in the manuscript.

      We have expanded Figure 1 to include both Studies 1 and 2 and a timeline of when subjective stress was assessed throughout the experiment as well as adding Figure S4 to the main manuscript (now top panel within Figure 4). 

      Reviewer #1 (Recommendations for the authors):

      (1) Study 2 shows a greater decrease in subjective stress after the high-control task manipulation than after the pleasant video. One possible confound is whether the amount of time to complete the WS task and the video differ. It could be helpful to look at the average completion time for the WS task and compare that to the length of the videos. Alternatively, in future studies, control for this by dynamically adjusting the video play length to each participant based on how long they took to complete the WS task.

      This is an interesting suggestion. As a result, we have included the time taken as a covariate in the stress induction and stress relief analyses to ensure that any differences in time between the WS task and video task were not accounting for any of the stress induction or relief analyses. Controlling for the total time taken did not impact the stress induction or relief results. This is included in the Sensitivity and Exploratory Analyses section on page 24:

      “Our second sensitivity analyses was conducted because the experiment took longer to complete for the video condition (mean = 54.3 minutes, SD = 12.4 minutes) than the WS task condition (mean = 39.7 minutes, SD = 12.8 minutes, t(186.19)=-9.32, p<.001). We therefore included the total time (in ms) as a covariate in the stress induction and stress relief analyses for Study 2. This showed that accounting for total time did not change the results of interest (Table S13), further highlighting that the time by control interactions were robust.”

      (2) Because participants received feedback about their success/failure in the WS task, a confounding factor could be that they received positive feedback on highly controllable trials and negative feedback on low control trials (and/or highly difficult trials). This would suggest that it is not controllability per se that contributes to stress perception but rather feedback valence. The authors show that this is a likely factor in their results in Study 2, which shows significant effects of the loss domain on perceived control and stress. Was a similar analysis done in Study 1? Do participants receive feedback in Study 1? It would be helpful to include this information somewhere in the manuscript. I would be curious to know whether *any* feedback at all influences controllability/stress perceptions.

      We thank the reviewer for this interesting suggestion. It is an interesting question as to whether feedback valence is related to stress in Study 1, and we have added this point to the Discussion on pages 27 and 28. To speak to this point, when we include the overall win rate (which captures the subsequent feedback received) when predicting subjective stress, win rate is not a significant predictor of stress over and above perceived difficulty and subjective control, suggesting that overall feedback valence may not be related to stress in Study 1. We take this as evidence that feedback may not be as important in terms of accounting for the relationship between stress and control. However, we unfortunately do not have any data in which there was no feedback provided to speak to this conclusively. This would be an interesting future study. The excerpt below is added to pages 27 and 28 of the discussion section:

      “Like with perceived difficulty, we statistically accounted for reward rate and showed that the relationship between subjective control and stress was not accounted for by reward rate, for example. Similarly, participants received feedback after every trial, and thus feedback valence may contribute to stress perception. However, given that overall win rate (which captures the feedback received during the task) did not predict stress over and above perceived difficulty or subjective control, it suggests that feedback is unlikely to relate to stress over and above difficulty. Future work will need to disentangle this further to rule out such potential confounds.”

      To respond specifically to the reviewer’s question about the feedback given to participants, written feedback was provided on screen to participants on a trial-bytrial basis also in Study 1 (i.e. for both studies), and we have provided more clarity about this in the manuscript on page 8 as well as providing additional details in Table S3:

      “After each trial, participants were shown written feedback on screen as to whether the segment had successfully stopped on the red zone (or not), and the associated reward (or lack of). See Table S3 for details.”

      (3) I'm not sure how to interpret the fact that in Figure S1, the BICs are all essentially the same. Does this mean that you don't really need all of these varying aspects of the task to achieve the same effects? Could the task be made simpler?

      The similarity of BIC values suggests that a simpler WS task would have produced a worse account of the data approximately in keeping with the extent to which it is a simpler model. Here, the BIC scores for the models are similar, suggesting that adding these parameters adds explanatory power in keeping with what would have been expected from adding a parameter, but not more. We do note that the BIC is a relatively strict and conservative comparison. The fact that the most complex model overall narrowly improves parsimony; combined with the interpretable parameter values and the prior expectations given the task setup led us to focus on this most complex model.  

      (4) A minor point, but the authors refer to their sample as "neurotypical." Were they assessed for prior/current psychopathology/medications? If not, I might use a different term here (perhaps "non-clinical sample"), since some prior work has shown that online samples actually have higher instances of psychopathology compared to community samples.

      We have changed the phrasing of ‘neurotypical’ to a ‘non-clinical sample’ as recommended.

      Reviewer #2 (Recommendations for the authors):

      Figure 4S is very informative and could be presented in the main text.

      We have expanded Figure 1 to include both Studies 1 and 2 and a timeline of when subjective stress was assessed throughout the experiment as well as adding Figure S4 to the main manuscript (top panel of Figure 4). 

      References:

      Dorfman, H. M., & Gershman, S. J. (2019). Controllability governs the balance between Pavlovian and instrumental action selection. Nature Communications, 10(1), 5826. https://doi.org/10.1038/s41467-019-13737-7

      DuPont, C. M., Pressman, S. D., Reed, R. G., Manuck, S. B., Marsland, A. L., & Gianaros, P. J. (2022). An online Trier social stress paradigm to evoke affective and cardiovascular responses. Psychophysiology, 59(10), e14067. https://doi.org/10.1111/psyp.14067

      Jangraw, D. C., Keren, H., Sun, H., Bedder, R. L., Rutledge, R. B., Pereira, F., Thomas, A. G., Pine, D. S., Zheng, C., Nielson, D. M., & Stringaris, A. (2023). A highly replicable decline in mood during rest and simple tasks. Nature Human Behaviour, 7(4), 596–610. https://doi.org/10.1038/s41562-023-015197

      Meier, M., Haub, K., Schramm, M.-L., Hamma, M., Bentele, U. U., Dimitroff, S. J., Gärtner, R., Denk, B. F., Benz, A. B. E., Unternaehrer, E., & Pruessner, J. C. (2022). Validation of an online version of the trier social stress test in adult men and women. Psychoneuroendocrinology, 142, 105818. https://doi.org/10.1016/j.psyneuen.2022.105818

      Nasso, S., Vanderhasselt, M.-A., Demeyer, I., & De Raedt, R. (2019). Autonomic regulation in response to stress: The influence of anticipatory emotion regulation strategies and trait rumination. Emotion, 19(3), 443–454. https://doi.org/10.1037/emo0000448

      Schlatter, S., Schmidt, L., Lilot, M., Guillot, A., & Debarnot, U. (2021). Implementing biofeedback as a proactive coping strategy: Psychological and physiological effects on anticipatory stress. Behaviour Research and Therapy, 140, 103834. https://doi.org/10.1016/j.brat.2021.103834

      Steinbeis, N., Engert, V., Linz, R., & Singer, T. (2015). The effects of stress and affiliation on social decision-making: Investigating the tend-and-befriend pattern. Psychoneuroendocrinology, 62, 138–148. https://doi.org/10.1016/j.psyneuen.2015.08.003

    1. Author response:

      Reviewer #1 (Public review):

      Summary:

      The manuscript reports a series of experiments designed to test whether optogenetic activation of infralimbic (IL) neurons facilitates extinction retrieval and whether this depends on animals' prior experience. In Experiment 1, rats underwent fear conditioning followed by either one or two extinction sessions, with IL stimulation given during the second extinction; stimulation facilitated extinction retrieval only in rats with prior extinction experience. Experiments 2 and 3 examined whether backward conditioning (CS presented after the US) could establish inhibitory properties that allowed IL stimulation to enhance extinction, and whether this effect was specific to the same stimulus or generalized to different stimuli. Experiments 5 - 7 extended this approach to appetitive learning: rats received backward or forward appetitive conditioning followed by extinction, and then fear conditioning, to determine whether IL stimulation could enhance extinction in contexts beyond aversive learning and across conditioning sequences. Across studies, the key claim is that IL activation facilitates extinction retrieval only when animals possess a prior inhibitory memory, and that this effect generalizes across aversive and appetitive paradigms.

      Strengths:

      (1) The design attempts to dissect the role of IL activity as a function of prior learning, which is conceptually valuable.

      We thank the Reviewer for their positive assessment.

      (2) The experimental design of probing different inhibitory learning approaches to probe how IL activation facilitates extinction learning was creative and innovative.

      We thank the Reviewer for their positive assessment.

      Weaknesses:

      (1) Non-specific manipulation.

      ChR2 was expressed in IL without distinction between glutamatergic and GABAergic populations. Without knowing the relative contribution of these cell types or the percentage of neurons affected, the circuit-level interpretation of the results is unclear.

      ChR2 was intentionally expressed in the infralimbic cortex (IL) without distinction between local neuronal populations for two reasons. First, this manuscript aimed to uncover some of the features characterizing the encoding of inhibitory memories in the IL, and this encoding likely engages interactions among various neuronal populations within the IL. Second, the hypotheses tested in the manuscript derived from findings that indiscriminately stimulated the IL using the GABA<sub>A</sub> receptor antagonist picrotoxin, which is best mimicked by the approach taken. We agree that it is also important to determine the respective contributions of distinct IL neuronal populations to inhibitory encoding; however, the global approach implemented in the present experiments represents a necessary initial step. This rationale will be incorporated into the revised manuscript, which will also make reference to the need to identify the relative contributions of the various neuronal populations within the IL. 

      (2) Extinction retrieval test conflates processes

      The retrieval test included 8 tones. Averaging across this many tone presentations conflate extinction retrieval/expression (early tones) with further extinction learning (later tones). A more appropriate analysis would focus on the first 2-4 tones to capture retrieval only. As currently presented, the data do not isolate extinction retrieval.

      It is unclear when retrieval of what has been learned across extinction ceases and additional extinction learning occurs. In fact, it is only the first stimulus presentation that unequivocally permits a distinction between retrieval and additional extinction learning, as the conditions for this additional learning have not been fulfilled at that presentation. However, confining evidence for retrieval to the first stimulus presentation introduces concerns that other factors could influence performance. For instance, processing of the stimulus present at the start of the session may differ from that present at the end of the previous session, thereby affecting what is retrieved. Such differences between the stimuli present at the start and end of an extinction session have been long recognized as a potential explanation for spontaneous recovery (Estes, 1955). More importantly, whether the test data presented confound retrieval and additional extinction learning or not, the interpretation remains the same with respect to the effects of a prior history of inhibitory learning on enabling the facilitative effects of IL stimulation. Finally, it is unclear how these facilitative effects could occur in the absence of the subjects retrieving the extinction memory formed under the stimulation. Nevertheless, the revised manuscript will provide the trial-by-trial performance during the post-extinction retrieval tests and discuss this issue.

      (3) Under-sampling and poor group matching.

      Sample sizes appear small, which may explain why groups are not well matched in several figures (e.g., 2b, 3b, 6b, 6c) and why there are several instances of unexpected interactions (protocol, virus, and period). This baseline mismatch raises concerns about the reliability of group differences.

      Efforts were made to match group performance upon completion of each training stage and before IL stimulation. Unfortunately, these efforts were not completely successful due to exclusions following post-mortem analyses. However, we acknowledge that the unexpected interactions deserve further discussion, and this will be incorporated into the revised manuscript (see also comment from Reviewer 2). Although we cannot exclude that sample sizes may have contributed to some of these interactions, we remain confident about the reliability of the main findings reported, especially given their replication across the various protocols. Overall, the manuscript provides evidence that IL stimulation does not facilitate brief extinction in the absence of prior inhibitory experience in five different experiments, replicating previous findings (Lingawi et al., 2018; Lingawi et al., 2017). It also replicates these previous findings by showing that prior experience with either fear or appetitive extinction enables IL stimulation to facilitate subsequent fear extinction. Furthermore, the facilitative effects of such stimulation following fear or appetitive backward conditioning are replicated in the present manuscript.  

      (4) Incomplete presentation of conditioning data.

      Figure 3 only shows a single conditioning session despite five days of training. Without the full dataset, it is difficult to evaluate learning dynamics or whether groups were equivalent before testing.

      We apologize, as we incorrectly labeled the X axis for the backward conditioning data set in Figures 3B, 4B, 4D and 5B. It should have indicated “Days” instead of “Trials”. This error will be corrected in the revised manuscript.

      (5) Interpretation stronger than evidence.

      The authors conclude that IL activation facilitates extinction retrieval only when an inhibitory memory has been formed. However, given the caveats above, the data are insufficient to support such a strong mechanistic claim. The results could reflect non-specific facilitation or disruption of behavior by broad prefrontal activation. Moreover, there is compelling evidence that optogenetic activation of IL during fear extinction does facilitate subsequent extinction retrieval without prior extinction training (Do-Monte et al 2015, Chen et al 2021), which the authors do not directly test in this study.

      As noted above, the revised manuscript will show that the interpretations of the main findings stand whether ore the test data confounds retrieval with additional extinction learning. The revised manuscript will also clarify the plotting of the data for the backward conditioning stages. We do agree that further discussion of the unexpected interactions is necessary, and this will also be incorporated into the revised manuscript. However, the various replications of the core findings provide strong evidence for their reliability and the interpretations advanced in the original manuscript. The proposal that the results reflect non-specific facilitation or disruption of behavior seems highly unlikely. Indeed, the present experiments and previous findings (Lingawi et al., 2018; Lingawi et al., 2017) provide multiple demonstrations that IL stimulation fails to produce any facilitation in the absence of prior inhibitory experience with the target stimulus. Although these demonstrations appear inconsistent with previous studies (Do-Monte et al., 2015; Chen et al., 2021), this inconsistency is likely explained by the fact that these studies manipulated activity in specific IL neuronal populations. Previous work has already revealed differences between manipulations targeting discrete IL neuronal populations as opposed to general IL activity (Kim et al., 2016). Importantly, as previously noted, the present manuscript aimed to generally explore inhibitory encoding in the IL that, as we will acknowledge, is likely to engage several neuronal populations within the IL. Adequate statements on these matters will be included in the revised manuscript.

      Impact:

      The role of IL in extinction retrieval remains a central question in the fear learning literature. However, because the test used conflates extinction retrieval with new learning and the manipulations lack cell-type specificity, the evidence presented here does not convincingly support the main claims. The study highlights the need for more precise manipulations and more rigorous behavioral testing to resolve this issue.

      As noted in our responses, the interpretations of the data presented remain identical whether the test data conflate extinction retrieval with additional extinction learning or not. Although we agree that it is important to establish the role of specific IL neuronal populations in extinction learning, this was beyond the scope of the manuscript and the findings reported remain valuable to our understanding of inhibitory encoding within the IL.

      Reviewer #2 (Public review):

      Summary:

      In this manuscript, the authors examine the mechanisms by which stimulation of the infralimbic cortex (IL) facilitates the retention and retrieval of inhibitory memories. Previous work has shown that optogenetic stimulation of the IL suppresses freezing during extinction but does not improve extinction recall when extinction memory is probed one day later. When stimulation occurs during a second extinction session (following a prior stimulation-free extinction session), freezing is suppressed during the second extinction as well as during the tone test the following day. The current study was designed to further explore the facilitatory role of the IL in inhibitory learning and memory recall. The authors conducted a series of experiments to determine whether recruitment of IL extends to other forms of inhibitory learning (e.g., backward conditioning) and to inhibitory learning involving appetitive conditioning. Further, they assessed whether their effects could be explained by stimulus familiarity. The results of their experiments show that backward conditioning, another form of inhibitory learning, also enabled IL stimulation to enhance fear extinction. This phenomenon was not specific to aversive learning, as backward appetitive conditioning similarly allowed IL stimulation to facilitate extinction of aversive memories. Finally, the authors ruled out the possibility that IL facilitated extinction merely because of prior experience with the stimulus (e.g., reducing the novelty of the stimulus). These findings significantly advance our understanding of the contribution of IL to inhibitory learning. Namely, they show that the IL is recruited during various forms of inhibitory learning, and its involvement is independent of the motivational value associated with the unconditioned stimulus.

      Strengths:

      (1) Transparency about the inclusion of both sexes and the representation of data from both sexes in figures.

      We thank the Reviewer for their positive assessment.

      (2) Very clear representation of groups and experimental design for each figure.

      We thank the Reviewer for their positive assessment.

      (3) The authors were very rigorous in determining the neurobehavioral basis for the effects of IL stimulation on extinction. They considered multiple interpretations and designed experiments to address these possible accounts of their data.

      We thank the Reviewer for their positive assessment.

      (4) The rationale for and the design of the experiments in this manuscript are clearly based on a wealth of knowledge about learning theory. The authors leveraged this expertise to narrow down how the IL encodes and retrieves inhibitory memories.

      We thank the Reviewer for their positive assessment.

      Weaknesses:

      (1) In Experiment 1, although not statistically significant, it does appear as though the stimulation groups (OFF and ON) differ during Extinction 1. It seems like this may be due to a difference between these groups after the first forward conditioning. Could the authors have prevented this potential group difference in Extinction 1 by re-balancing group assignment after the first forward conditioning session to minimize the differences in fear acquisition (the authors do report a marginally significant effect between the groups that would undergo one vs. two extinction sessions in their freezing during the first conditioning session)?

      As noted (see response to Reviewer 1), efforts were made daily to match group performance across the training stages, but these efforts were ultimately hampered by the necessary exclusions following post-mortem analyses. This will be made explicit in the revised manuscript. Regarding freezing during Extinction 1, as noted by the Reviewer, the difference, which was not statistically significant, was absent across trials during the subsequent forward fear conditioning stage. Likewise, the protocol difference observed during the initial forward fear conditioning was absent in subsequent stages. We are therefore confident that these initial differences (significant or not) did not impact the main findings at test. Importantly, these findings replicate previous work using identical protocols in which no differences were present during the training stages. These considerations will be addressed in the revised manuscript.

      (2) Across all experiments (except for Experiment 1), the authors state that freezing during the initial conditioning increased across "days". The figures that correspond to this text, however, show that freezing changes across trials. In the methods, the authors report that backward conditioning occurred over 5 days. It would be helpful to understand how these data were analyzed and collated to create the final figures. Was the freezing averaged across the five days for each trial for analyses and figures?

      We apologize, as noted above, we incorrectly labeled the X axis for the backward conditioning data sets in Figures 3B, 4B, 4D and 5B. It should have indicated “Days” instead of “Trials”. The data shown in these Figures use the average of all trials on a given day. This will be clarified in the methods section of the revised manuscript. The labeling errors on the Figures will be corrected.

      (3) In Experiment 3, the authors report a significant Protocol X Virus interaction. It would be useful if the authors could conduct post-hoc analyses to determine the source of this interaction. Inspection of Figure 4B suggests that freezing during the two different variants of backward conditioning differs between the virus groups. Did the authors expect to see a difference in backward conditioning depending on the stimulus used in the conditioning procedure (light vs. tone)? The authors don't really address this confounding interaction, but I do think a discussion is warranted.

      We agree with the Reviewer that further discussion of the Protocol x Virus interaction that emerged during the backward conditioning and forward conditioning stages of Experiment 3 is warranted. This will be provided in the revised manuscript. Briefly, during both stages, follow-up analyses did not reveal any differences (main effects or interactions) between the two groups trained with the light stimulus (Diff-EYFP and Diff-ChR2). By contrast, the ChR2 group trained with the tone (Back-ChR2) froze more overall than the EYFP group (Back-EYFP), but there were no other significant differences between the two groups. Based on these analyses, the Protocol x Virus interaction appears to be driven by greater freezing in the ChR2 group trained with the tone rather than a difference in the backward conditioning performance based on stimulus identity. Consistent with this, the statistical analyses did not reveal a main effect of Protocol during either the backward conditioning stage or the stimulus trials during the forward conditioning stage. Nevertheless, during this latter stage, a main effect of Protocol emerged during baseline performance, but once again, this seems to be driven by the Back-ChR2 group. Critically, it is unclear how greater stimulus freezing in the Back-ChR2 group during forward conditioning would lead to lower freezing during the post-extinction retrieval test.  

      (4) In this same experiment, the authors state that freezing decreased during extinction; however, freezing in the Diff-EYFP group at the start of extinction (first bin of trials) doesn't look appreciably different than their freezing at the end of the session. Did this group actually extinguish their fear? Freezing on the tone test day also does not look too different from freezing during the last block of extinction trials.

      We confirm that overall, there was a significant decline in freezing across the extinction session shown in Figure 4B. The Reviewer is correct to point out that this decline was modest (if not negligible) in the Diff-EYFP group, which was receiving its first inhibitory training with the target tone stimulus. It is worth noting that across all experiments, most groups that did not receive infralimbic stimulation displayed a modest decline in freezing during the extinction session since it was relatively brief, involving only 6 or 8 tone alone presentations. This was intentional, as we aimed for the brief extinction session to generate minimal inhibitory learning and thereby to detect any facilitatory effect of infralimbic stimulation. This issue will be clarified and explained in the revised version of the manuscript.

      (5) The Discussion explored the outcomes of the experiments in detail, but it would be useful for the authors to discuss the implications of their findings for our understanding of circuits in which the IL is embedded that are involved in inhibitory learning and memory. It would also be useful for the authors to acknowledge in the Discussion that although they did not have the statistical power to detect sex differences, future work is needed to explore whether IL functions similarly in both sexes.

      In line with the Reviewer’s suggestion (see also Reviewer 3), the revised manuscript will include a discussion of the broader implications of the findings regarding inhibitory brain circuitry and will acknowledge the need to further explore sex differences and IL functions.

      Reviewer #3 (Public review):

      Summary:

      This is a really nice manuscript with different lines of evidence to show that the IL encodes inhibitory memories that can then be manipulated by optogenetic stimulation of these neurons during extinction. The behavioral designs are excellent, with converging evidence using extinction/re-extinction, backwards/forwards aversive conditioning, and backwards appetitive/forwards aversive conditioning. Additional factors, such as nonassociative effects of the CS or US, are also considered, and the authors evaluate the inhibitory properties of the CS with tests of conditioned inhibition.

      Strengths:

      The experimental designs are very rigorous with an unusual level of behavioral sophistication.

      We thank the Reviewer for their positive assessment.

      Weaknesses:

      (1) More justification for parametric choices (number of days of backwards vs forwards conditioning) could be provided.

      All experimental parameters were based on previously published experiments showing the capacity of the backward conditioning protocols to generate inhibitory learning and the forward conditioning protocols to produce excitatory learning. Although this was mentioned in the methods section, we acknowledge that further explanation is required to justify the need for multiple days of backward training. This will be provided in the revised manuscript.

      (2) The current discussion could be condensed and could focus on broader implications for the literature.

      The revised manuscript will make an effort to condense the discussion and focus on broader implications for the literature.

      References

      Chen, Y.-H., Wu, J.-L., Hu, N.-Y., Zhuang, J.-P., Li, W.-P., Zhang, S.-R., Li, X.-W., Yang, J.-M., & Gao, T.-M. (2021). Distinct projections from the infralimbic cortex exert opposing effects in modulating anxiety and fear. J Clin Invest, 131(14), e145692. https://doi.org/10.1172/JCI145692

      Do-Monte, F. H., Manzano-Nieves, G., Quiñones-Laracuente, K., Ramos-Medina, L., & Quirk, G. J. (2015). Revisiting the role of infralimbic cortex in fear extinction with optogenetics. J Neurosci, 35(8), 3607-3615. https://doi.org/10.1523/JNEUROSCI.3137-14.2015

      Estes, W. K. (1955). Statistical theory of spontaneous recovery and regression. Psychol Rev, 62(3), 145-154. https://doi.org/10.1037/h0048509

      Kim, H.-S., Cho, H.-Y., Augustine, G. J., & Han, J.-H. (2016). Selective Control of Fear Expression by Optogenetic Manipulation of Infralimbic Cortex after Extinction. Neuropsychopharmacology, 41(5), 1261-1273. https://doi.org/10.1038/npp.2015.276

      Lingawi, N. W., Holmes, N. M., Westbrook, R. F., & Laurent, V. (2018). The infralimbic cortex encodes inhibition irrespective of motivational significance. Neurobiol Learn Mem, 150, 64-74. https://doi.org/10.1016/j.nlm.2018.03.001

      Lingawi, N. W., Westbrook, R. F., & Laurent, V. (2017). Extinction and Latent Inhibition Involve a Similar Form of Inhibitory Learning that is Stored in and Retrieved from the Infralimbic Cortex. Cereb Cortex, 27(12), 5547-5556. https://doi.org/10.1093/cercor/bhw322

    1. Author response:

      The following is the authors’ response to the original reviews

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      This manuscript presents a study on expectation manipulation to induce placebo and nocebo effects in healthy participants. The study follows standard placebo experiment conventions with the use of TENS stimulation as the placebo manipulation. The authors were able to achieve their aims. A key finding is that placebo and nocebo effects were predicted by recent experience, which is a novel contribution to the literature. The findings provide insights into the differences between placebo and nocebo effects and the potential moderators of these effects.

      Specifically, the study aimed to:

      (1) assess the magnitude of placebo and nocebo effects immediately after induction through verbal instructions and conditioning

      (2) examine the persistence of these effects one week later, and

      (3) identify predictors of sustained placebo and nocebo responses over time.

      Strengths:

      An innovation was to use sham TENS stimulation as the expectation manipulation. This expectation manipulation was reinforced not only by the change in pain stimulus intensity, but also by delivery of non-painful electrical stimulation, labelled as TENS stimulation.

      Questionnaire-based treatment expectation ratings were collected before conditioning and after conditioning, and after the test session, which provided an explicit measure of participants' expectations about the manipulation.

      The finding that placebo and nocebo effects are influenced by recent experience provides a novel insight into a potential moderator of individual placebo effects.

      We thank the reviewer for their thorough evaluation of our manuscript and for highlighting the novelty and originality of our study.

      Weaknesses:

      There are a limited number of trials per test condition (10), which means that the trajectory of responses to the manipulation may not be adequately explored.

      We appreciate the reviewer’s comment regarding the number of trials in the test phase. The trial number was chosen to ensure comparability with previous studies addressing similar research questions with similar designs (e.g. Colloca et al., 2010). Our primary objective was to directly compare placebo and nocebo effects within a within-subject design and to examine their persistence one week after the first test session. While we did not specifically aim to investigate the trajectory of responses within a single testing session, we fully agree that a comprehensive analysis of the trajectories of expectation effects on pain would be a valuable extension of our work. We have now acknowledged this limitation and future direction in the revised manuscript.

      The paragraph reads as follows: “It is important to note that our study was designed in alignment with previous studies addressing similar questions (e.g., Colloca et al., 2010). Our primary aim was to directly compare placebo and nocebo effects in a within-subject design and assess their persistence of these effects one week following the first test session. One limitation of our approach is the relatively short duration of each session, which may have limited our ability to examine the trajectory of responses within a single session. Future studies could address this limitation by increasing the number of trials for a more comprehensive analysis.”

      On day 8, one stimulus per stimulation intensity (i.e., VAS 40, 60, and 80) was applied before the start of the test session to re-familiarise participants with the thermal stimulation. There is a potential risk of revealing the manipulation to participants during the re-familiarization process, as they were not previously briefed to expect the painful stimulus intensity to vary without the application of sham TENS stimulation.

      We thank the reviewer for the opportunity to clarify this point. Participants were informed at the beginning of the experiment that we would use different stimulation intensities to re-familiarize them with the stimuli before the second test session. We are therefore confident that participants perceived this step as part of a recalibration rather than associating it with the experimental manipulation. We have added this information to the revised version of the manuscript.

      The paragraph now reads as follows: “On day 8, one stimulus per stimulation intensity (i.e., VAS 40, 60 and 80) was applied before the start of the test session to re-familiarise participants with the thermal stimulation. Note that participants were informed that these pre-test stimuli were part of the recalibration and refamiliarization procedure conducted prior to the second test session.”

      The differences between the nocebo and control conditions in pain ratings during conditioning could be explained by the differing physiological effects of the different stimulus intensities, so it is difficult to make any claims about expectation effects here.

      We appreciate the reviewer’s comment and agree that, despite the careful calibration of the three pain stimuli, we cannot entirely rule out the possibility that temporal dynamics during the conditioning session were influenced by differential physiological effects of the varying stimulus intensities (e.g., intensity-dependent habituation or sensitization). We have addressed this in the revision of the manuscript, but we would like to emphasize that the stronger nocebo effects during the test phase are statistically controlled for any differences in the conditioning session.

      The paragraph now reads: “This asymmetry is noteworthy in and of itself because it occurred despite the equidistant stimulus calibration relative to the control condition prior to conditioning. It may be the result of different physiological effects of the stimuli over time or amplified learning in the nocebo condition, consistent with its heightened biological relevance, but it could also be a stronger effect of the verbal instructions in this condition.”

      A randomisation error meant that 25 participants received an unbalanced number of 448 trials per condition (i.e., 10 x VAS 40, 14 x VAS 60, 12 x VAS 80).

      We agree that this is indeed unfortunate. However, we would like to point out that all analyses reported in the manuscript have been controlled for the VAS ratings in the conditioning session, i.e., potential effects of the conditioned placebo and nocebo stimuli. Moreover, we have now conducted additional analyses, presented here in our response to the reviewers, to demonstrate that this imbalance did not systematically bias the results. Importantly, the key findings observed during the test phase remain robust despite this issue.

      Specifically, when excluding these 25 participants from the analyses, the reported stronger nocebo compared to placebo effects in the test session on day 1 remain unchanged. Likewise, the comparison of placebo and nocebo effects between days 1 and 8 shows the same pattern when excluding the participants in question. The only exception is the interaction between effect (placebo vs nocebo) x session (day 1 vs day 8), which changed from a borderline significant result (p = .049) to insignificant (p = .24). However, post hoc tests continued to show the same pattern as originally reported: a significant reduction in the nocebo effect from day 1 to day 8 and no significant change in the placebo effect.

      Reviewer #2 (Public review):

      Summary:

      Kunkel et al aim to answer a fundamental question: Do placebo and nocebo effects differ in magnitude or longevity? To address this question, they used a powerful within-participants design, with a very large sample size (n=104), in which they compared placebo and nocebo effects - within the same individuals - across verbal expectations, conditioning, testing phase, and a 1-week follow-up. With elegant analyses, they establish that different mechanisms underlie the learning of placebo vs nocebo effects, with the latter being acquired faster and extinguished slower. This is an important finding for both the basic understanding of learning mechanisms in humans and for potential clinical applications to improve human health.

      Strengths:

      Beyond the above - the paper is well-written and very clear. It lays out nicely the need for the current investigation and what implications it holds. The design is elegant, and the analyses are rich, thoughtful, and interesting. The sample size is large which is highly appreciated, considering the longitudinal, in-lab study design. The question is super important and well-investigated, and the entire manuscript is very thoughtful with analyses closely examining the underlying mechanisms of placebo versus nocebo effects.

      We thank the reviewer for their positive evaluation of our manuscript and for acknowledging the methodological rigor and the significant implications for clinical applications and the broader research field.

      Weaknesses:

      There were two highly addressable weaknesses in my opinion:

      (1) I could not find the preregistration - this is crucial to verify what analyses the authors have committed to prior to writing the manuscript. Please provide a link leading directly to the preregistration - searching for the specified number in the suggested website yielded no results.

      We thank the reviewer for pointing this out. We included a link to the preregistration in the revised manuscript. This study was pre-registered with the German Clinical Trial Register (registration number: DRKS00029228; https://drks.de/search/de/trial/DRKS00029228).

      (2) There is a recurring issue which is easy to address: because the Methods are located after the Results, many of the constructs used, analyses conducted, and even the main placebo and nocebo inductions are unclear, making it hard to appreciate the results in full. I recommend finding a way to detail at the beginning of the results section how placebo and nocebo effects have been induced. While my background means I am familiar with these methods, other readers will lack that knowledge. Even a short paragraph or a figure (like Figure 4) could help clarify the results substantially. For example, a significant portion of the results is devoted to the conditioning part of the experiment, while it is unknown which part was involved (e.g., were temperatures lowered/increased in all trials or only in the beginning).

      We thank the reviewer for their helpful comment and agree that the Results section requires additional information that would typically be provided by the Methods section if it directly followed the Introduction. In response, we have moved the former Figure 4 from the Methods section to the beginning of the Results section as a new Figure 1, to improve clarity. Further, we have revised the Methods section to explicitly state that all trials during the conditioning phase were manipulated in the same way.

      Recommendations for the Authors:

      Reviewer #1 (Recommendations for the authors):

      (1) Given that the authors are claiming (correctly) that there is only limited work comparing placebo/nocebo effects, there are some papers missing from their citations:

      Nocebo responses are stronger than placebo responses after subliminal pain conditioning - - Jensen, K., Kirsch, I., Odmalm, S., Kaptchuk, T. J. & Ingvar, M. Classical conditioning of analgesic and hyperalgesic pain responses without conscious awareness. Proc. Natl. Acad. Sci. USA 112, 7863-7 (2015)

      We thank the reviewer and have now included this relevant publication into the introduction of the revised manuscript.

      Hird, E.J., Charalambous, C., El-Deredy, W. et al. Boundary effects of expectation in human pain perception. Sci Rep 9, 9443 (2019). https://doi.org/10.1038/s41598-019-45811-x

      We thank the reviewer for suggesting this relevant publication. We have now included it into the discussion of the revised manuscript by adding the following paragraph:

      “Recent work using a predictive coding framework further suggests that nocebo effects may be less susceptible to prediction error than placebo effects (Hird et al., 2019), which could contribute to their greater persistence and strength in our study.”

      (2) The trial-by-trial pain ratings could have been usefully modelled with a computational model, such as a Bayesian model (this is especially pertinent given the reference to Bayesian processing in the discussion). A multilevel model could also be used to increase the power of the analysis. This is a tentative suggestion, as I appreciate it would require a significant investment of time and work - alternatively, the authors could acknowledge it in the Discussion as a useful future avenue for investigation, if this is preferred.

      We thank the reviewer for this thoughtful suggestion. While we agree that computational modelling approaches could provide valuable insights into individual learning, our study was not designed with this in mind and the relatively small number of trials per condition and the absence of trial-by-trial expectancy ratings limit the applicability of such models. We have therefore chosen not to pursue such analysis but highlight it in the discussion as a promising direction for future research.

      “Notably, the most recent experience was the most predictive in all three analyses; for instance, the placebo effect on day 8 was predicted by the placebo effect on day 1, not by the initial conditioning. This finding supports the Bayesian inference framework, where recent experiences are weighted more heavily in the process of model updating because they are more likely to reflect the current state of the environment, providing the most relevant and immediate information needed to guide future actions and predictions24. Interestingly, while a change in pain predicted subsequent nocebo effects, it seemed less influential than for placebo effects. This aligns with findings that longer conditioning enhanced placebo effects, while it did not affect nocebo responses10 and the conclusion that nocebo instruction may be sufficient to trigger nocebo responses. Using Bayesian modeling, future studies could identify individual differences in the development of placebo and nocebo effects by integrating prior experiences and sensory inputs, providing a probabilistic framework for understanding the underlying mechanisms.”

      (3) The paper is missing any justification of sample size, i.e. power analysis - please include this.

      We apologize for the missing information on our a priori power analysis. As there is a lack of prior studies investigating within-subjects comparisons of placebo and nocebo effects that could inform precise effect size estimates for our research question, we based our calculation on the ability detect small effects. Specifically, the study was powered to detect effect sizes in the range of d = 0.2 - 0.25 with α = .05 and power = .9, yielding a required sample size of N = 83-129. We have now added this information to the methods section of the revised manuscript.

      (4) "On day 8, one stimulus per stimulation intensity (i.e., VAS 40, 60 and 80) was applied before the start of the test session to re-familiarise participants with the thermal stimulation."

      What were the instructions about this? Was it before the electrode was applied? This runs the risk of unblinding participants, as they only expect to feel changes in stimulus intensity due to the TENS stimulation.

      We thank the reviewer for pointing out the potential risk of unblinding participants due to the re-familiarization process prior to the second test session. We would like to clarify that we followed specific procedures to prevent participants from associating this process with the experimental manipulation. The re-familiarisation with the thermal stimuli was conducted after the electrode had been applied and re-tested to ensure that both stimulus modalities were re-introduced in a consistent and neutral context. Participants were explicitly informed that both procedures were standard checks prior to the actual test session (“We will check both once again before we begin the actual measurement.”). For the thermal stimuli, we informed participants that they would experience three different intensities to allow the skin to acclimate (e.g., “...we will test the heat stimuli in 3 trials with different temperatures, allowing your skin to acclimate to the stimuli. …”), without implying any connection to the experimental conditions.

      Importantly, this re-familiarization procedure mirrored what participants had already experienced during the initial calibration session on day 1. We therefore assume that participants interpreted as a routine technical step rather than part of the experimental manipulation. We have now clarified this procedure in the methods section of the revised manuscript.

      (5) "For a comparison of pain intensity ratings between time-points, an ANOVA with the within-subject factors Condition (placebo, nocebo, control) and Session (day 1, day 8) was carried out. For the comparison of placebo and nocebo effects between the two test days, an ANOVA with the with-subject factors Effect (placebo effect, nocebo effect) and Session (day 1, day 8) was used."

      It seems that one ANOVA is looking at raw pain scores and one is looking at difference scores, but this is a bit confusing - please rephrase/clarify this, and explain why it is useful to include both.

      We thank the reviewer for highlighting this point. Our primary analyses focus on placebo and nocebo effects, which we define as the difference in pain intensity ratings between the control and the placebo condition (placebo effect) and the nocebo and the control condition (nocebo effect), respectively.

      To examine whether condition effects were present at each time-point, we first conducted two separate repeated measures ANOVAs - one for day 1 and one for day 8 - with the within-subject factor CONDITION (placebo, nocebo, control).

      To compare the magnitude and persistence of placebo and nocebo effects over time, we then calculated the above-mentioned difference scores and submitted these to a second ANOVA with within-subject factors EFFECT (placebo vs. nocebo effect) and SESSION (day 1 vs. day 8). We have now clarified this approach on page 19 of the revised manuscript. To avoid confusion, the Condition x Session ANOVA has been removed from the manuscript.

      (6) Please can the authors provide a figure illustrating trial-by-trial ratings during test trials as well as during conditioning trials?

      In response to the reviewer’s point, we now provide the trial-by-trial ratings of the test phases on days 1 and 8 as an additional figure in the Supplement (Figure S1) and would like to clarify that trial-by-trial pain intensity ratings of the conditioning phase are displayed in Figure 2C of the manuscript,

      (7) "Separate multiple linear regression analyses were performed to examine the influence of expectations (GEEE ratings) and experienced effects (VAS ratings) on subsequent placebo and nocebo effects. For day 1, the placebo effect was entered as the dependent variable and the following variables as potential predictors: (i) expected improvement with placebo before conditioning, (ii) placebo effect during conditioning and (iii) the expected improvement with placebo before the test session at day 1"

      The term "placebo effect during conditioning" is a bit confusing - I believe this is just the effect of varying stimulus intensities - please could the authors be more explicit on the terminology they use to describe this? NB changes in pain rating during the conditioning trials do not count as a placebo/nocebo effect, as most of the change in rating will reflect differences in stimulation intensity.

      We agree with the reviewer that the cited paragraph refers to the actual application of lower or higher pain stimuli during the conditioning session, rather than genuinely induced placebo or nocebo effect. We thank the reviewer for this helpful observation and have revised the terminology, accordingly, now referring to these as “pain relief during conditioning” and “pain worsening during conditioning”.

      (8) Supplementary materials: "The three temperature levels were perceived as significantly different (VAS ratings; placebo condition: M= 32.90, SD= 16.17; nocebo condition: M= 56.62, SD= 17.09; control condition: M= 80.84, SD= 12.18"

      This suggests that the VAS rating for the control condition was higher than for the nocebo condition. Please could the authors clarify/correct this?

      We thank the reviewer for spotting this error. The values for the control and the nocebo condition had accidentally been swapped. This has now been corrected in the manuscript: control condition: M= 56.62, SD= 17.09; nocebo condition: M= 80.84, SD= 12.18.

      (9) "To predict placebo responses a week later (VAScontrol - VASplacebo at day 8), the same independent variables were entered as for day 1 but with the following additional variables (i) the placebo effect at day 1 and (ii) the expected improvement with placebo before the test session at day 8."

      Here it would be much clearer to say 'pain ratings during test trials at day 1".

      We agree with the reviewer and have revised the manuscript as suggested.

      (10) For completeness, please present the pain intensity ratings during conditioning as well as calibration/test trials in the figure.

      Please see our answer to comment (6).

      (11) In Figure 1a, it looks like some participants had rated the control condition as zero by day 8. If so, it's inappropriate to include these participants in the analysis if they are not responding to the stimulus. Were these the participants who were excluded due to pain insensitivity?

      On day 8, the lowest pain intensity ratings observed were VAS 3 in the placebo condition and VAS 2 in the control condition, both from the same participant. All other participants reported minimum values of VAS 11 or higher (all on a scale from 0-100). Thus, no participant provided a pain rating of VAS 0, and all ratings indicated some level of pain perception in response to the stimulus. We did not define an exclusion criterion based on day 8 pain ratings in our preregistration, and we did not observe any technical issues with the stimulation procedure. To avoid post-hoc exclusions and maintain consistency with our preregistered analysis plan, we therefore decided to include all participants in the analysis.

      (12) "Comparison of day 1 and day 8. A direct comparison of placebo and nocebo effects on day 1 and day 8 pain intensity ratings showed a main effect of Effect with a stronger nocebo effect (F(1,97)= 53.93, 131 p< .001, η2= .36) but no main effect of Day (F(1,97)= 2.94, p= .089, η2 = .029). The significant Effect x Session interaction indicated that the placebo effect and the nocebo effect developed differently over time (F(1,97)= 3.98, p= .049, η2 = .039)"

      This is confusing as it talks about a main effect of "day" and then interaction with "session" - are they two different models? The authors need to clarify.

      We thank the reviewer for pointing this out. In our analysis, “Session” is the correct term for the experimental factor, which has two factor levels, “day 1” and “day 8”. This has now been corrected in the revised manuscript.

      Reviewer #2 (Recommendations for the authors):

      (1) More information on how "size of the effect" in Figures 1b and 2b was calculated is needed; this can be in the legend. If these are differences between control and each condition, then they were reversed for one condition (nocebo?), which is ok - but this should be clearly explained.

      We agree with the reviewer and have now revised the figure legends to improve clarity. The legends now read:

      1b: “Figure 1. Pain intensity ratings and placebo and nocebo effects during calibration and test sessions. (A) Mean pain intensity ratings in the placebo, nocebo and control condition during calibration, and during the test sessions at day 1 and day 8. (B) Placebo effect (control condition - placebo condition, i.e., positive value of difference) and nocebo effect (nocebo condition - control condition, i.e., positive value of difference) on day 1 and day 8. Error bars indicate the standard error of the mean, circles indicate mean ratings of individual participants. *: p < .001, : p < .01, n.s.: non-significant.”

      2b: “Figure 2. Mean and trial-by-trial pain intensity ratings, placebo and nocebo effects during conditioning. (A) Mean pain intensity ratings of the placebo, nocebo and control condition during conditioning. (B) Placebo effect (control condition - placebo condition, i.e., positive value of difference) and nocebo effect (nocebo condition - control condition, i.e., positive value of difference) during conditioning. (C) Trial-by-trial pain intensity ratings (with confidence intervals) during conditioning. Error bars indicate the standard error of the mean, circles indicate mean ratings of individual participants. ***: p < .001.”

      (2) In the methods, I was missing a clear understanding of how many trials there were in the conditioning phase, and then how many in the other testing phases. Also, how long did the experiment last in total?

      We apologize that the exact number of trials in the testing phases was not clear in the original manuscript. We now indicate on page 18 of the revised manuscript that we used 10 trials per condition in the test sessions. We have also added information on the duration of each test day (i.e., three hours on day 1 and one hour on day 8) on page 15.

      (3) In expectancy ratings, line 186 - are improvement and worsening expectations different from expected pain relief? It is implied that these are two different constructs - it would be helpful to clarify that.

      We agree that this is indeed confusing and would like to clarify that both refer to the same construct. We used the Generic rating scale for previous treatment experiences, treatment expectations, and treatment effects (GEEE questionnaire, Rief et al. 2021) that discriminates between expected symptom improvement, expected symptom worsening, and expected side effects due to a treatment. We now use the terms “expected pain relief” and “expected pain worsening” throughout the whole manuscript.

      (4) In the last section of the Results, somatosensory amplification comes out of nowhere - and could be better introduced (see point 2 above).

      We agree with the reviewer that introducing the concept of somatosensory amplification and its potential link to placebo/nocebo effects only in the Methods is unhelpful, given that this section appears at the end of the manuscript. We therefore now introduce the relevant publication (Doering et al., 2015) before reporting our findings on this concept.

      (5) In line 169, if the authors want to specify what portion of the variance was explained by expectancy, they could conduct a hierarchical regression, where they first look at R2 without the expectancy entered, and only then enter it to obtain the R2 change.

      We fully agree that hierarchical regression can be a useful approach for isolating the contribution of variables. However, in our case, expectancy was assessed at different time points (e.g., before conditioning and before the test session on day 1), and there was no principled rationale for determining the order in which these different expectancy-related variables should be entered into a hierarchical model.

      That said, in response to the reviewer’s suggestion, we have now conducted hierarchical regression analyses in which all expectancy-related variables were entered together as a single block (see below). These analyses largely confirmed the findings reported so far and are provided here in the response to the reviewers below. Given the exploratory nature of this grouping and the lack of an a priori hierarchy, we feel that the standard multiple regression models remain the most appropriate for addressing our research question because it allows us to evaluate the total contribution of expectancy-related predictors while also examining the individual contribution of each variable within the block. We would therefore prefer to retain these as the primary analyses in the manuscript.

      Results of the hierarchical regression analyses:

      Day 1 - Placebo response: In step 1, we entered the difference in pain intensity ratings between the control and the placebo condition during conditioning as a predictor. In step 2, we added the two variables reflecting expectations (i.e., expected improvement with placebo (i) before conditioning and (ii) before the test session on day 1). This allowed us to assess whether expectation-related variables explained additional variance beyond the effect of conditioning.

      The overall regression model at step 1 was significant, F(1, 102) = 13.42, p < .001, explaining 11.6% of the variance in the dependent variable (R<sup>2</sup> = .116). Adding the expectancy-related predictors in step 2 did not lead to a significant increase in explained variance, ΔR<sup>2</sup> = .007, F(2, 100) = 0.384, p = .682. Thus, the conditioning response significantly predicted placebo-related pain reduction on day 1, but additional information on expectations did not account for further variance.

      Day 1 - Nocebo response: The equivalent analysis was run for the nocebo response on day 1. In step 1, the pain intensity difference between the nocebo and the control condition was entered as a predictor before adding the two expectancy ratings (i.e., expected worsening with nocebo (i) before conditioning and (ii) before the test session on day 1).

      In step 1, the regression model was not statistically significant, F(1, 102) = 2.63, p = .108, and explained only 2.5% of the variance in nocebo response (R<sup>2</sup> = .025). Adding the expectation-related predictors in Step 2 slightly increased the explained variance by ΔR<sup>2</sup> = .027, but this change was also non-significant, F(2, 100) = 1.41, p = .250. The overall variance explained by the full model remained low (R<sup>2</sup> = .052). These results suggest that neither conditioning nor expectation-related variables reliably predicted nocebo-related pain increases on day 1.

      Day 8 - Placebo response: For the prediction of the placebo effect on day 8, the following variables reflecting perceived effects were entered as predictors in step 1: the difference in pain intensity ratings between the control and the placebo condition (i) during conditioning and (ii) on day 1. In step 2, the variables reflecting expectations were added: the expected improvement with placebo (i) before conditioning, (ii) before the test session on day 1 and (iii) before the test session on day 8.

      In step 1, the model was statistically significant, F(3, 95) = 14.86, p < .001, explaining 23.8% of the variance in the placebo response (R<sup>2</sup> = .238, Adjusted R<sup>2</sup> = .222). In step 2, the addition of the expectation-related predictors resulted in a non-significant improvement in model fit, ΔR<sup>2</sup> = .051, F(3, 92) = 2.21, p = .092. The overall variance explained by the full model increased modestly to 29.0%.

      Day 8 - Nocebo response: For the equivalent analyses of nocebo responses on day 8, the following variables were included in step 1: the difference in pain intensity ratings between the nocebo and the control condition (i) during conditioning and (ii) on day 1. In step 2, we entered the variables reflecting nocebo expectations including expected worsening with nocebo (i) before conditioning, (ii) before the test session on day 1 and (iii) before the test session on day 8. In step 1, the model significantly predicted the day 8 nocebo response, F(3, 95) = 6.04, p = .003, accounting for 11.3% of the variance (R<sup>2</sup> = .113, Adjusted R<sup>2</sup> = .094). However, the addition of expectation-related predictors in Step 2 resulted in only a negligible and non-significant improvement, ΔR<sup>2</sup> = .006, F(3, 92) = 0.215, p = .886. The full model explained just 11.9% of the variance (R<sup>2</sup> = .119).

      Typos:

      (6) Abstract - 104 heathy xxx (word missing).

      (7) Line 61 - reduce or decrease - I think you meant increase.

      Thank you, we have now corrected both sentences.

      References

      Colloca L, Petrovic P, Wager TD, Ingvar M, Benedetti F. How the number of learning trials affects placebo and nocebo responses. Pain. 2010

      Doering BK, Nestoriuc Y, Barsky AJ, Glaesmer H, Brähler E, Rief W. Is somatosensory amplification a risk factor for an increased report of side effects? Reference data from the German general population. J Psychosom Res. 2015

    1. Kasirzadeh’s account of accumulative risk still relies on threat actors such as cyberattackers to a large extent, whereas our concern is simply about the current path of capitalism. And we think that such risks are unlikely to be existential, but are still extremely serious

      so not so much about a single Superintelligent AI, as society gradually drowning in AI enshittification. it may not be existential to society but it still really sucks

    1. Some design scholars have questioned whether focusing on people and activities is enough to account for what really matters, encouraging designers to consider human values77 Friedman, B., & Hendry, D. G. (2019). Value sensitive design: Shaping technology with moral imagination. MIT Press. . For example, instead of viewing a pizza delivery app as a way to get pizza faster and more easily, we might view it as a way of supporting the independence of elderly who do not have the mobility to pick up a pizza on their own. Or, perhaps more darkly, instead of viewing TSA screening at an airport a way of identifying potential terrorists, we consider it through the value of power, as the screening process had more to do with maintaining political power in times of fear than it did with actually preventing terrorism. This shift in framing can enable designers to better consider the values of design stakeholders through their design process, and identify people they may not have designed for otherwise (e.g., people who are house bound because of injury, or politicians).

      This section specifically got me reflecting about to what degree should human values be balanced when comparing to people and activities. The way I see it, I believe the people and activities (and systems) should be the main focus whenever one is designing. Shifting the focus to an aspect as subjective as "human values" may go into a downfall sacrificing resources that could be otherwise used towards a people/activity focused design. Overall I think that encouraging the consideration of subject matters similar to these may end up wasting resources.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      Summary:

      In this study, Ledamoisel et al. examined the evolution of visual and chemical signals in closely related Morpho butterfly species to understand their role in species coexistence. Using an integrative, state-of-the-art approach combining spectrophotometry, visual modeling, and behavioral mate choice experiments, they quantified differences in wing iridescence and assessed its influence on mate preference in allopatry and sympatry. They also performed chemical analyses to determine whether sympatric species exhibit divergent chemical cues that may facilitate species recognition and mate discrimination. The authors found iridescent coloration to be similar in sympatric Morpho species. Furthermore, male mate choice experiments revealed that in sympatry, males fail to discriminate conspecific females based on coloration, reinforcing the idea that visual signal convergence is primarily driven by predation pressure. In contrast, the divergence of chemical signals among sympatric species suggests their potential role in facilitating species recognition and mate discrimination. The authors conclude that interactions between ecological pressures and signal evolution may shape species coexistence.

      Strengths:

      The study is well-designed and integrates multiple methodological approaches to provide a thorough assessment of signal evolution in the studied species. I appreciate the authors' careful consideration of multiple selective pressures and their combined influence on signal divergence and convergence. Additionally, the inclusion of both visual and chemical signals adds an interesting and valuable dimension to the study, enhancing its importance. Beyond butterflies, this research broadens our understanding of multimodal communication and signal evolution in the context of species coexistence.

      Weaknesses:

      (1) The broader significance of the findings needs to be better articulated. While the authors emphasize that comparing adaptive traits in sympatry and allopatry provides insights into selective processes shaping reproductive isolation and coexistence, it is unclear what key conceptual or theoretical questions are being addressed. Are these patterns expected under certain evolutionary scenarios? Have they been empirically demonstrated in other systems? The authors should explicitly state the overarching research question, incorporate some predictions, and better contextualize their findings within the existing literature. If the results challenge or support previous work, that should be highlighted to strengthen the study's importance in a broader context.

      We thank the reviewer for their valuable feedback. We understand that the framing of the results and the discussion may fail to convey the broader significance of our findings. In the first version of the manuscript, we framed our manuscript around the processes shaping reproductive isolation and co-existence in sympatry, but now realize that this question was too broad in regards to our results. We thus strictly focused on outlining the importance of ecological interactions in the evolution of traits in sympatric species. In the revised version of the manuscript, we rewrote the first paragraph of the introduction to introduce context regarding the effect of ecological interactions on trait evolution (lines 43-60). We then explicitly introduce the theoretical question investigated in our paper (i.e. “we investigate how ecological interactions in sympatry can constrain natural and sexual selection shaping trait evolution”, lines 62-63) and our predictions regarding the evolution of traits in sympatry vs. allopatry (lines 74-80). We also added predictions regarding our experiments on Morpho at the end of the introduction (lines 146-157). As a result, the discussion is now better aligned with the introduction, by discussing the putative effect of predation and mate choice on the evolution of wing iridescence in Morpho.

      (2) The motivation for studying visual signals and mate choice in allopatric populations (i.e., at the intraspecific level) is not well articulated, leaving their role in the broader narrative unclear. In particular, the rationale behind experiments 1, 2, and 3 is not well defined, as the authors have not made a strong case for the need for these intraspecific comparisons in the introduction. This issue is further compounded by the authors' primary focus on signal evolution in sympatry throughout both the results and the discussion. For instance, the divergence of iridescence in allopatry is a potentially interesting result. But the authors have not discussed its implications.

      We now clearly state in the introduction our motivation for studying visual signals and mate choice in allopatric populations (lines 74-80, lines 146-157). We argued that intraspecific comparisons help identify whether visual cues can be used in mate recognition between phylogenetically close subspecies, between whom visual resemblance is supposed to be higher than between closely-related species (tetrad experiment, and experiment 1). As M. h. bristowi and M. h. theodorus have different wing pattern, we also used this comparison to identify the traits involved in male mate preference within a species, testing the importance of iridescent color (experiment 2) or iridescent patterning (experiment 3). The results of those experiments can then be used to assess whether these traits are used in species recognition between sympatric species. See also our answers to recommendations 11 and 15 from reviewer #1.

      Overall, given that the primary conclusions are based on results and analyses in sympatry, the role of allopatric populations in shaping these conclusions needs to be better integrated and justified. Without a stronger link between the comparative framework and the study's key takeaways, the use of allopatric populations feels somewhat peripheral rather than central to the study's aim. Since the primary conclusions remain valid even without the allopatric comparisons, their inclusion requires a clearer rationale.

      To make a stronger case for the use of the allopatric population in our manuscript, we strengthened the justification behind the study of intraspecific allopatric populations vs. interspecific sympatric populations, as the iridescence measurements and the mate choice experiments in allopatric populations can serve as a baseline in studying how species interactions can shape the evolution of traits and mate recognition when compared to sympatric populations. Following your major comment #1, we rewrote the introduction to include a justification to the need for studying allopatric vs. sympatric populations (lines 74-80), and also further highlighted the need to study iridescence in sympatric species to fully understand the trait evolution of sympatric species in the discussion (339-343).

      (3) While the authors demonstrate that iridescence is indistinguishable to predators in sympatry, they overstate the role of predation in driving convergence. The present study does not experimentally demonstrate that iridescence in this species has a confusion effect or contributes to evasive mimicry. Alternatively, convergence could result from other selective forces, such as signal efficacy due to environmental conditions, rather than being solely driven by predation.

      We acknowledge that our study does not directly demonstrate that iridescence contributes to evasive mimicry. We did tone down the interpretation of the results in the discussion and state that predation is not the only selective pressure that could have promoted a convergent evolution of iridescence in sympatric species, as iridescence is a trait that could be involved in thermoregulation (lines 346-353) and camouflage (lines 363-369) for example. We made sure to mention that convergence in iridescent signals in sympatry is only an indirect support to the evasive mimicry hypothesis, and that further research is still needed, including direct predation experiments, to show that this convergence is indeed triggered by predation (lines 391-396).  

      Reviewer #2 (Public review):

      This study presents an investigation of the visual and chemical properties and mating behaviour in Morpho butterflies, aimed at addressing the nature of divergence between closely related species in sympatry. The study species consists of three subspecies of Morpho helenor (bristowi, theodorus, and helenor), and the conspecific Morpho achilles achilles. The authors postulate that whereas the iridescent blue signals of all (sub)species should function as a predator reduction signal (similar to aposematism) and therefore exhibit convergence, the same signals should indicate divergence if used as a mating signal, particularly in sympatric populations. They also assess chemical profiles among the species to assess the potential utility of scent in mediating species/sex discrimination.

      The authors first used reflectance spectrometry to calculate hue, brightness, and chroma, plus two measures of "iridescence" (perhaps better phrased as angular dependence) in each (sub)species. This indicated the ubiquitous presence of sexual dimorphism in brightness (males brighter), which also appears to be the case for iridescence (Figure 3A-B). Analysis of these data also indicated that whereas there is evidence for divergence among subspecies in allopatry, the same evidence is lacking for species in sympatry (P = 0.084). This was supported further by visual modelling, which showed that both conspecifics and birds should be (theoretically) capable of perceiving the colour difference among allopatric populations of M. helenor, whereas the same is not true for the sympatric species.

      The authors then conducted mate choice trials, first using live individuals and second using female dummies. The live experiments indicated the presence of assortative mating among the two subspecies of M. helenor (bristowi and theodorus). The dummy presentations indicated (a) bristowi males prefer conspecific wings, whereas theodorus have no preference, (b) bristowi males prefer the con(sub)specific colour pattern, (c) theodorus prefer the con(sub)specific iridescence when the pattern is manipulated to be similar among female dummies. A fourth experiment, using sympatric M. achilles and M. helenor, indicated no preference for conspecific female dummies. Finally, chemical analysis indicated substantial differences between these two species in putative pheromone compounds, and especially so in the males.

      The authors conclude that the similarity of iridescence among species in sympatry is suggestive of convergence upon a common anti-predation signal. Despite some behavioural evidence in favourof colour (iridescence)-based mate discrimination, chemical differences between Achilles and Helenor are posed as more likely to function for species isolation than visual differences.

      Overall, I enjoyed reading this manuscript, which presents a valiant attempt at studying visual, chemical and behavioural divergence in this iconic group of butterflies.

      Major comments

      My only major comment concerns the authors' favoured explanation for aposematism (or evasive mimicry) for convergence among species, which is based upon the you-can't-catch-me hypothesis first presented by Young 1971. Although there is supporting work showing that iridescent-like stimuli are more difficult to precisely localize by a range of viewers, most of the evidence as applied to the Morpho system is circumstantial, and I'm not certain that there is widespread acceptance of this hypothesis. Given that the present study deals with closely-related  (sub)species, one alternative explanation - a "null" hypothesis of sorts - is for a lack of divergence (from a common starting point) as opposed to evolutionary convergence per se. in other words, two subspecies are likely to retain ancestral character states unless there is selection that causes them to diverge. I feel that the manuscript would benefit from a discussion of this alternative, if not others. Signalling to predators could very well be involved in constraining the extent of convergence, but this seems a little premature to state as an up-front conclusion of this work. There is also the result of a *dorsal* wing manipulation by Vieira-Silva et al. 2024 which seems difficult to reconcile in light of this explanation. Whereas this paper is cited by the authors, a more nuanced discussion of their experimental results would seem appropriate here.

      We thank the reviewer for their constructive comments on our manuscript. We appreciate the reviewer’s concern regarding the way iridescence convergence between sympatric species is discussed in our manuscript, which align with similar concerns raised by Reviewer 1. Indeed, the you-can't-catch-me hypothesis has not been yet empirically tested in Morpho, this is currently a working hypothesis only supported by indirect lines of evidence.

      Among the 30 known Morpho species, iridescence is most likely the ancestral character, notably because iridescence is a trait shared by a majority of Morpho (we now mention this in the introduction lines 108-110). In this paper, we thus did not aim to identify the evolutionary forces involved in the appearance of iridescence in this group, but rather wanted to understand to what extent ecological interactions can impact the diversification (or not) of this trait. As such, the dorsal manipulations performed in Vieira-Silva et al 2024 showing that iridescence in Morpho may have a similar effect than crypsis does not impact our working hypothesis. Instead, we use VieraSilva et al 2024 to discuss the potential anti-predator effect of iridescence, that could potentially promote convergent evolution of iridescent patterns.

      In the main text, we now clearly mention our null hypothesis: under a scenario of neutral evolution of iridescence, we would expect that the divergence in wing coloration between two M. helenor subspecies would be lower than between two different Morpho species (M. helenor and M. achilles) and showed that our results sharply differ from this null expectation.

      We then improved the discussion by adding alternative hypotheses potentially explaining the convergent iridescent signal detected in sympatric species: we discussed the expected effect under neutral evolution (lines 339-343), but also added alternative hypotheses regarding the diversification of iridescence due to camouflage (lines 363-369), predator evasion (lines 373-377) and thermoregulation (lines 346-353).

      Reviewer #3 (Public review):

      The authors investigated differences in iridescence wing colouration of allopatric (geographically separated) and sympatric (coexisting) Morpho butterfly (sub)species. Their aim was to assess if iridescence wing colouration of Morpho (sub)species converged or diverged depending on coexistence and if iridescence wing colouration was involved in mating behaviour and reproductive isolation. The authors hypothesize that iridescence wing colouration of different (sub)species should converge in sympatry and diverge in allopatry. In sympatry, iridescence wing colouration can act as an effective antipredator defence with shared benefits if multiple (sub)species share the same colouration. However, shared wing colouration can have potential costs in terms of reproductive interference since wing colouration is often involved in mate recognition. If the benefits of a shared antipredator defence outweigh the costs of reproductive interference, iridescence wing colouration will show convergence and alternative mate recognition strategies might evolve, such as chemical mate recognition. In allopatry, iridescence wing colouration is expected to diverge due to adaptation to different local conditions and no alternative mate recognition is expected.

      Strengths:

      (1) Using allopatric and sympatric (sub)species that are closely related is a powerful way to test evolutionary hypotheses

      (2) By clearly defining iridescence and measuring colour spectra from a variety of angles, applying different methods, a very comprehensive dataset of iridescence wing colouration is achieved.

      (3) By experimentally manipulating wing coloration patterns, the authors show visual mate recognition for M. h. bristowi and could, in theory, separate different visual aspects of colouration (patterns VS iridescence strength).

      (4) Measurements of chemical profiles to investigate alternative mate recognition strategies in case of convergence of visual signals.

      Weaknesses:

      In my opinion, studies should be judged on the methods and data included, and not on additional measurements that could have been taken or additional treatments/species that should be included, since in most ecological and evolutionary studies, more measurements or treatments/species can always be included. However, studies do need to ensure appropriate replication and appropriate measurements to test their hypothesis AND support their conclusions. The current study failed to ensure appropriate replication, and in various cases, the results do not support the conclusions.

      First, when using allopatric and sympatric (sub)species pairs to test evolutionary hypotheses, replication is important. Ideally, multiple allopatric and sympatric (sub)species pairs are compared to avoid outlier (sub)species or pairs that lead to biased conclusions. Unfortunately, the current study compares 1 allopatric and 1 sympatric (sub)species pair, hence having poor (no) replication on the level of allopatric and sympatric (sub)species pairs,

      We would like to thank the reviewer for their constructive feedback. We agree that replication is important to test evolutionary hypotheses and that our study lacks replication for allopatric and sympatric Morpho populations. Ideally, one would require several allopatric and sympatric replicates to conclude on the effect of species interaction in trait evolution. Our study is a preliminary attempt at answering this question, covering a few Morpho populations but proposing a broad assessment of iridescence and mate preference for those populations. We clearly mentioned in the discussion that investigating multiple populations is needed to test whether the trend we observed in this paper can be generalized (line 388-392).

      Second, chemical profiles were only measured for sympatric species and not for allopatric (sub)species, which limits the interpretation of this data. The allopatric (sub)species could have been measured as non-coexistence "control". If coexistence and convergence in wing colouration drives the evolution of alternative mate recognition signals, such alternative signals should not evolve/diverge for allopatric (sub)species where wing colouration is still a reliable mate recognition cue. More importantly, no details are provided on the quantification of butterfly chemical profiles, which is essential to understand such data. It is unclear how the chemical profiles were quantified and what data (concentrations, ratios, proportions) were used to perform NDMS and generate Figure 5 and the associated statistical tests.

      We recognize that having the chemical profiles of the genitalia of the Morpho from the allopatric populations would have made a stronger case in favor of reinforcement acting on the divergence of the chemical compounds found on the genitalia of the sympatric Morpho species. Due to limited access to the biological material needed at the time of the chromatography, we could not test for lower divergence in the chemical profiles of allopatric Morpho butterflies. We made sure to mention this limitation in the discussion (lines 457-461). 

      We already stated in the methods that we compiled the area under the peak of each components found in the chromatograms of our samples and that we performed all the statistical analyses on this dataset. To make it clearer, we mention in the new version of the manuscript that the area under the peak of each component allows to measure the concentration of the components (in the methods lines 720, 723, 733). We also added some precisions in the legend of Figure 5.

      Third, throughout the discussion, the authors mention that their results support natural selection by predators on iridescent wing colouration, without measuring natural selection by predators or any other measure related to predation. It is unclear by what predators any of the butterfly species are predated on at this point

      We made sure to mention in the introduction (line 132-136) and in the discussion (line 373-377) that previous predation experiments performed on Morpho and other butterflies showed evidence that birds are likely predators for these species. These observations lead us to test for the putative effect of predation on the evolution of their color pattern, without directly testing predatory rates. We made sure this information is transparent in the revised manuscript, and now precise that assessing wing convergence is only an indirect way of testing the escape mimicry hypothesis (line 393-396).

      To continue on the interpretation of the data related to selection on specific traits by specific selection agents: This study did not measure any form of selection or any selection agent. Hence, it is not known if iridescent wing colouration is actually under selection by predators and/or mates, if maybe other selection agents are involved or if these traits converge due to genetic correlations with other traits under selection. For example, Iridescent colouration in ground beetles has functions as antipredator defence but also thermo- and water regulation. None of these issues are recognized or discussed.

      The lack of discussion of alternative selective pressures involved in the evolution of iridescence was pointed out by all reviewers. We thus modified the text to account for this comment, and no longer limit our discussion to the putative effects of predation. We now specifically discuss alternative hypotheses, including crypsis (362-369) and thermoregulation (line 346-353).

      Finally, some of the results are weakly supported by statistics or questionable methodology.

      Most notably, the perception of the iridescence coloration of allopatric subspecies by bird visual systems. Although for females, means and errors (not indicated what exactly, SD, SE or CI) are clearly above the 1 JND line, for males, means are only slightly above this line and errors or CIs clearly overlap with the 1 JND line. Since there is no additional statistical support, higher means but overlap of SD, SE or CI with the baseline provides weak statistical support for differences.

      We thank the reviewer for bringing interpretation issues concerning the chromatic distances of allopatric Morpho species measured with a bird vision model. We made sure to be nuanced in the description of this graph in the results section (line 208-212). Note that this addition does not change our main conclusion stating that Morpho and predator visual models better discriminate iridescence differences between allopatric subspecies than between sympatric species.

      We now also clearly mention in the figure’s legend that the error bars represent the confidence intervals obtained after performing a bootstrap analysis, in addition to the mention of the nature of the error bars already mentioned in the methods (line 580).

      Regarding the assortative mating experiment, the results are clearly driven by M. bristowi. For M. theodorus, females mate equally often with conspecifics (6 times) as with M. bristowi (5 times). For males, the ratio is slightly better (6 vs 3), but with such low numbers, I doubt this is statistically testable. Overall low mating for M. bristowi could indicate suboptimal experimental conditions, and hence results should be interpreted with care.

      We recognize that the tetrad experiment results are mainly driven by M. bristowi’s behavior as already mentioned in the results (line 231-232) but we now also mention it in the discussion (lines 401-402). This experiment would have benefited from more replicates, but the limited access to live males and virgin females for both subspecies was a limiting factor. Fisher’s exact test used to assess assortative mating is specifically appropriate to small sample sizes. We recognize that the sampling size is not ideal, however it is still statistically testable.

      Regarding the wing manipulation experiment, M. theodorus does not show a preference when dummies with non-modified wings are presented and prefers non-modified dummies over modified dummies. This is acknowledged by the authors but not further discussed. Certainly, some control treatment for wing modification could have been added.

      The use of controls to consider the effect of wing modification and odor by the permanent marker were already mentioned in the methods (lines 636-639). Following your recommendation and comments from the other reviewers, we now mention the use of this control in the results (lines 278283). We also address a potential issue that would have resulted in the rejection of these modified dummies by live males: we cannot be sure whether butterflies perceive these modifications as equivalent to natural coloration (lines 281-282). An additional control could have been used, adding black ink on the black dorsal parts of the pattern to assess its potential visual effect. The constraints on sampling unfortunately did not allow to add another treatment.

      Overall, the fact that certain measurements only provide evidence for 1 of the 2 (sub)species (assortative mating, wing manipulation) or one sex of one of the species (bird visual systems) means overall interpretation and overgeneralization of the results to both allopatric or sympatric species should be done with care, and such nuances should ideally be discussed.

      The aim of the authors, "to investigate the antagonistic effects of selective pressures generated by mate recognition and shared predation" has not been achieved, and the conclusions regarding this aim are not supported by the results. Nevertheless, the iridescence colour measurements are solid, and some of the behavioural experiments and chemical profile measurements seem to yield interesting results. The study would benefit from less overinterpretation of the results in the framework of predation and more careful consideration of methodological difficulties, statistical insecurities, and nuances in the results.

      Overall, we would like to thank all reviewers for their thorough assessment of our work. We understand that the imbalance between mate choice data, visual model data and chemical data only gives us a partial assessment of species recognition in Morpho butterflies, thus requiring more precision in the interpretation and the discussion of our results. We made sure to add balanced interpretations in our discussion, by mentioning the lack of replicates for allopatric and sympatric populations (lines 391-392), and the lack of chemical characterization of allopatric species (lines 458361, see previous comments) and by being more transparent on methodological limitations that we failed to convey in the first version of our manuscript. We brought nuance to our discussion and also discussed alternative hypotheses to predation to explain the convergence of iridescence found in sympatry.

      Reviewing Editor Comments:

      While all reviewers acknowledge the value of your data, they converge in their recommendations to tone down the evolutionary interpretations. Ideally, to test your main hypothesis, you would need several species pairs, or if only one, as in your case, replicated sympatric and allopatric sites for both species. Furthermore, your more specific hypotheses about convergence (vs. nondivergence), response to predators (vs. other environmental variables), and avoiding interspecific mating in sympatry (vs. not avoiding it in allopatry) would require appropriate alternative treatments/controls. We therefore recommend that you focus on those statements that you can support with your experiments and data, and introduce these statements in the introduction with reference to the appropriate literature.

      Reviewer #1 (Recommendations for the authors):

      (1) Line 25: This stated aim seems a bit off. The authors did not sensu stricto quantify 'how shared adaptive traits may shape genetic divergence' in this study. I suggest rewriting or deleting this whole sentence altogether. The study's aim is already clear in lines 29-34.

      We deleted the mention of the characterization of genetic divergence, since this study did not focus on any genetic analysis.

      (2) Line 34: The authors here state that they compared allopatric vs sympatric populations. This is strictly not true for M. Achilles. Further, the results after this sentence focus solely ondivergence/convergence in sympatry, nothing at the intraspecific level and implications of the findings

      We now mention that we tested allopatric vs. sympatric species of M. helenor only (lines 28-29). We also mention that the behavioral experiments were based on intraspecific comparisons, and discuss the implications of this result in the discussion.

      (3) Line 35: 'convergence driven by predation': this is a strong statement and cannot be directly inferred from the present set of experiments. Consider toning it down.

      We added nuance to this statement by rephrasing it “suggesting that predation may favors local resemblance” (lines 32-33)

      (4) Line 36: Replace 'behavioral results' with 'behavioral experiments' or something similar.

      Corrected

      (5) Line 45-49: These opening statements need some citations.

      We provided references for the first few lines, by citing terHorst et al 2018 (line 44) underlining the importance of species interactions in trait evolution, and Blomberg et al 2003 (line 45) showing that closely-related species tend to resemble each other by quantifying the phylogenetic signal of various traits.

      (6) Line 83, 165: 'visual effect', not sure what the authors are referring to. Please rewrite.

      We defined “visual effect” as the way wing color patterns could be perceived by predators or mates. We removed mentions of “visual effect” and directly used its definition instead.

      (7) Line 105 onwards: This section of the introduction could benefit from more concise writing. The authors might consider reducing the number of specific examples and instead offering broader general statements, supported by citations from multiple studies.

      We reduced the number of examples given in this paragraph and used general statements supported by multiple citations as examples. (lines 102-119).

      (8) Line 108-110: This sentence seems to be redundant with the previous one.

      We merged this sentence with the previous one to improve clarity. (lines 103-105)

      (9) Line 140: 'with chemical defenses': include citations here.

      We added citations of Joron et al 1999 and Merrill et al 2014, which document the evolution of convergent wing patterns (mimicry) in butterfly species with chemical-defenses.

      (10) Line 149: This is a bit of a stretch. Note that genetic divergence could be influenced by many other things, not only the processes that the authors examined.

      We agree with the reviewer that the study of the convergent vs. divergent evolution of visual cues is not enough to fully understand the mechanisms allowing genetic divergence between species. Because this paper does not focus on characterizing genetic divergence, we removed it from the manuscript to avoid oversimplification.

      (11) Line 151: Again. Here, the author's primary focus seems to be at an interspecific level. One is left to wonder about the need for comparisons at the intraspecific level in M.helenor and the implications. Please clarify

      In the end of the introduction (lines 146-157), we specifically highlighted the importance of intraspecific comparisons. While studying the effect of sympatry on the evolution of the iridescent color pattern, we use this intraspecific comparison as a baseline to account for convergence or divergence of iridescence in a sympatric interspecific pair of Morpho, because under neutral evolution two subspecies are expected to be more similar than two different species (this assumption has been clarified line 147-148). We also used intraspecific mate choice to test for the use of visual cues in mate recognition (experiment 1) and to test what type of signal could be perceived by Morphos (the iridescent coloration or the iridescent pattern, experiment 2 and 3). These results help contextualize the interspecific mate choice, focused on determining whether visual cues could also be used in species recognition. Since we show that iridescent coloration is important in mate recognition at the intraspecific scale, it helps understand why species recognition is low at the interspecific scale because of wing color convergence between M. helenor and M. achilles.

      (12) Line 154: 'signals on mate preferences'.

      Corrected.

      (13) Line 189: 'At the intraspecific level', maybe in the brackets include 'allopatric populations' just so the results are in a similar format as in the color contrast section below.

      We added details to make clearer that the intraspecific level is studied between allopatric Morpho populations (line 189).

      (14) Line 189-192: Please rearrange the figure (current B as A and vice versa) or present the results in order as in the figure (interspecific first and then intraspecific level).

      We rearranged Figure 3 so that the intraspecific comparison (allopatric population) appears as A and the interspecific level (sympatric population) appears as B, to follow the order of presentation in the main text.

      (15) Line 232: The motivation behind experiments 1, 2, and 3 is unclear. The authors have not made a strong point in the introduction about the need for these comparisons at an intraspecific level. Given that the authors are focused on divergence/convergence at an interspecific level, this set of experiments seems to be irrelevant to the present study. The implications of these findings are also not discussed.

      We added motivation to the use of experiment 1, 2, and 3 in the introduction (lines 151-154) by stating that those experiments were used to assess whether blue color could indeed be used as a mating cue in Morpho helenor (experiment 1) and to try to understand what part of the visual signal is important in mate choice in Morpho helenor: the wing pattern (experiment 2) or the iridescent coloration (experiment 3). Although motivation for these experiments was not detailed in our manuscript, we already discussed the implications of the results of experiments 1, 2 and 3 in the discussion by stating that visual cues can take many forms and that considering both color AND pattern is important in understanding visual cues (lines 408-416). We carefully reworked this new version to make it more straightforward.

      (16) Line 260: Insert 'wild-type' before model to ensure similar wording as in the previous section.

      Corrected.

      (17) Line 286: Insert 'sympatric' after mimetic.

      Corrected.

      (18) Line 307: Include a reference to the figures or table where these results are presented.

      We now mention in the main text that the different proportions of beta-ocimene found between males M. helenor and M. achilles are shown in Table S2.

      (19) Line 343: These inferences are speculative. Add a line here, something like 'although this warrants further research in this species'.

      We detailed what additional experiments are needed lines 388-396.

      (20) Line 357: The authors have not discussed their results on iridescence divergence in allopatric populations (line 190) and its implications.

      We now made clear in the beginning of the discussion that the divergence of iridescence in allopatric populations is used as a baseline to test for convergent iridescence between species (lines 339-343).

      (21) Line 361 onwards: This first paragraph is a bit confusing, as the results mainly focus on allopatry, while the title refers to sympatry.

      To avoid confusion between the title and the content of the discussion, we divided the last part of the discussion into two different parts. As the first paragraph mainly focus on allopatry, we isolated it and titled it “Iridescent color patterns can be used as mate recognition cues in M. helenor” (line 498). The next paragraph of the discussion, focusing on the sympatric Morpho populations, has been titled “Evolution of visual and olfactory cues in mimetic sister-species living in sympatry” (line 418).

      (21)  Line 383: visual cues 'as' poor species.

      Corrected.

      (23) Line 405: Why females here and not males? This is again confusing since the authors tested for male mate choice in the main experiments. Some background information on sex-specific mate choice in the methods might help.

      In this specific sentence, we talk about performing mate choice experiments to test for the discrimination of olfactory cues by females (and not males) because we found a high divergence in the chemical compounds found on male genitalia. Although female chemical compounds could also be used as a cue by males in mate recognition, olfactive mate choice is often driven by female choice in butterflies. We recognize that this perspective does not line up with the mate choice presented in our results section which focused on male mate choice based on visual cues, because of ecological reasons (Morpho males tend to be attracted to bright blue colorations but not females) and technical reasons (in cages, females tend to hide away from the males or male dummies, and this behavior is not compatible with experiments involving flying around false males). In the discussion, we made sure to precise that the perspective we cite here is about testing the implications of divergence in male olfactory cues (line 454). We also added motivation to why we chose to investigate male (and not female) mate choice based on visual cues in the methods (lines 613-618) and in the results (219-223).

      (24) Line 417: This inference is speculative. Consider toning it down.

      We rewrote the sentence: “We find evidence of converging iridescent patterns in sympatry suggesting that predation could play a major role in the evolution of iridescence. Further work is nevertheless needed to directly test this hypothesis and establish the important of evasive mimicry in Morpho” (lines 465-468).

      (25) Line 429: 'Convergent trait evolution leads to mutualistic interactions enhancing coexistence'. Careful here. It is not very evident how convergent trait evolution (iridescence) is mutualistic in this case, as there is no experimental evidence for evasive mimicry yet. Consider rewording or toning this sentence down.

      We agree with the reviewer and removed this statement, only keeping the end of the sentence: “Altogether, this study addresses how convergence in one trait as a result of biotic interactions may alter selection on traits in other sensory modalities, resulting in a complex mosaic of biodiversity. (lines 479-481).

      (26) Line 442: Since the samples come from a breeding farm, I have a few questions. How are the authors sure about the location where the specimens were collected? How long have they been kept in captivity? Have they been subjected to any artificial selection? More details are needed here.

      Since M. helenor bristowi and M. helenor theodorus are only found in the wild in West and East Ecuador respectively, those M. helenor subspecies can only be collected in those two allopatric populations. Their phenotype is directly linked to their geographic repartition, this is how we made sure about their collect location. M. h. theodorus we used in this study were caught in East Ecuador in Tena, and M. h. bristowi were caught in West Ecuador in Pedro Vincente Madonado. We received pupae from the breeding farm, meaning that the Morpho used for the experiments were raised in captivity since their date of emergence. Upon emergence, they were transferred into cages for 4 to 5 days to wait for sexual maturity before performing the tetrad and mate choice experiments. This information was added to the method (lines 490-496).

      (27) Line 476: Include some citations supporting this statement.

      We now cite Bennett and Théry (2007), reviewing avian color vision, and Briscoe (2008), characterizing the sensitivity of the photoreceptors found in the eyes of butterflies. Both citations show that the 300-700nm range is seen by avian and butterfly visual systems.

      (28) Line 480 onwards: Please clarify if the analysis used only one value (mean?) per species, sex, angle of measurement, and locality or included data from multiple individuals.

      The analyses of both colorimetric variables and global iridescence were performed using iridescence data from multiple individuals (10 males and 10 females from M. h. bristowi, M. h. theodorus, M. h. helenor and M. a. achilles), for which we measured iridescence at 21 angles of illumination. Sampling size are mentioned lines 507, 515, 540-542.

      (29) Line 510: Is there a specific reason that authors did not investigate achromatic contrasts? Provide some justification here. Or include the results of achromatic contrasts in the supplement.

      We added the achromatic results in the supplement and in the results (lines 200-204). For both the avian visual model and the Morpho visual model, the confidence intervals always overlapped with the JND threshold, showing that neither birds nor butterflies could theoretically discriminate the wing reflectance brightness in allopatric and sympatric populations.

      (30) Line 552 onwards: I may have missed it. It is not entirely clear why the authors focused on male mate choice rather than female preference for visual cues. The authors should explicitly justify this choice and cite previous studies demonstrating that male mate choice, rather than female preference, is important in this species. This should be stated in the results section as well.

      We added a paragraph in the method (lines 613-618) to describe the ecological and technical reasons leading to testing only male mate choice using visual cues (also see our response to recommendation #23).

      (31) Line 537 onwards: What was the criterion used to score that mating had occurred? Why first mating and not how long they were mating? Please add these details.

      We stopped the experiment as soon as a male/female pair was formed by joining their genitalia (we added this information in the method lines 599-600). Since the tetrad experiment involves the interaction of two males and two females from different subspecies, we considered that mate choice happened before the formation of any couple, and is not necessarily dependent on how long they mate by observing their mating behavior. For instance, we witnessed avoidance behaviors from females that systematically hide their genitalia and refused to join their abdomen to some males, while being very ‘open’ to others (but did not quantify it).  

      (32) Line 571: The authors used a black permanent marker to modify wing patterns but did not validate whether butterflies perceive these modifications as equivalent to natural coloration. It is possible that the alterations introduced unintended visual cues and may explain why most males rejected the dummies (line 267). The authors should acknowledge this limitation here.

      We now acknowledge this limitation in the method (lines 638-639) and in the results section (lines 278-283).

      (33) Line 591: Insert 'above' after protocol.

      Corrected.

      (34) Line 605: If the authors included random effects in their model, then it should be generalized linear mixed model (GLMM) and not GLM as they wrote.

      We indeed included a random effect in our model accounting for male ID and trial number, we thus replaced “GLM” by “GLMM” in the manuscript.

      (35) Line 615: This set of analyses does not seem to account for pseudo-replication, as the data were recorded from the same male more than once (Line 583). Please clarify and redo the analysis with the GLMM framework

      We run new analyses using the GLMM framework: we used a binomial GLMM to test whether individuals preferentially interacted with dummy 1 vs. dummy 2 while accounting for pseudoreplication. The previously detected tendencies hold true with these new analyses, except for the visual mate discrimination of M. achilles: we now find statistical evidence that M. achilles tend to approach more their conspecifics during the mate choice experiment, although the signal is weak (line 297-307). Indeed, while we previously concluded that both species in sympatry (M. helenor and M. achilles) could not discriminate their conspecific mates, we now emphasize that M. achilles is somewhat sensitive to some visual signals. However, its estimated probability of approaching a conspecific is only 0.54, which is low compared to the estimated probability of approaching (0.61) or touching (0.84) a con-subspecific for M. bristowi. We thus concluded that even though some visual cues could be relevant for mate recognition, they are less reliable for male choice in sympatric populations were color patterns are more convergent, compared to allopatric populations. We thus updated Figure 4 and Figure S8 and S9, which are now picturing the probability of approaching or touching a conspecific or con-subspecific with the updated pvalues retrieved from the GLMM analyses. We also updated the results (line 297-307) and the discussion (lines 430-438) to bring nuance to our previous results.  

      (36) Line 963: Figure 3D. Is there a particular reason for comparing allopatric populations only within Ecuador rather than between Ecuador and French Guiana for M. helenor? Please clarify.

      We aimed at comparing the putative discrimination of blue coloration using visual models vs. what the butterflies actually discriminate using mate choice experiments. Since we only performed mate choice experiments involving M. h. bristowi x M. h. theodorus (allopatric populations within Ecuador) and M. h. helenor x M. a. achilles (sympatric population from Ecuador), we only looked at those comparisons using visual models. We added this precision lines (559-560).

      (37) Line 980: Are these predicted probabilities or just mean proportions as written in line 614? Then the label should be changed to 'Proportion of approaches' or something similar.

      Following our answer to recommendation #35, the points now represent the probability of touching a conspecific in the graph for each male, for every trial of every male tested. We corrected the legend of the figure. 

      Reviewer #2 (Recommendations for the authors):

      (1) Line 25: "...therefore facilitating co-existence in sympathy".

      Corrected.

      (2) Line 28: "contrasting" instead of contrasted.

      Corrected.

      (3) Line 33: begin a new sentence at the colon.

      Corrected.

      (4) Line 49: the phrase "habitat filtering" is unclear and should perhaps be defined or qualified.

      We replaced “habitat filtering” by its definition and cited Keddy (1992), describing the community assembly rules and defining habitat filtering (line 46)

      (5) Line 52: remove "even".

      Corrected.

      (6) Line 53: divergent suites may also result because traits are often constrained by genetic architecture (multivariate genetic covariances). This is discussed at length and specifically in relation to ornamental coloration by Kemp et al. 2023

      We rewrote the introduction and focused on only reviewing the ecological interactions promoting trait divergence in sympatric species, and did not mention genetics in this paper.

      (7) Line 87: (and throughout) refer to "colouration" or "colour pattern" rather than "colourations".

      Corrected.

      (8) Line 151: Remove "To do so,".

      Corrected.

      (9) Line 191: I would like to see the degrees of freedom for this test.

      We added the F-statistic=2.09 and the degrees of freedom df=1 of this test, and for all the following tests.

      (10) Line 201: (and throughout) replace "on" with "of".

      Corrected.

      (11) Line 205: modelling the visual properties of the wings allows one to infer what is theoretically visible/distinguishable. The modelling is useful but not necessarily definitive of vision/behaviour per se under different conditions in the wild. I therefore think it is appropriate to phrase the wording around the modelling approach more carefully. Perhaps refer to "theoretical" or "inferred" discriminability, or state (e.g.) that species should/should not be capable of perceiving differences based on the modelling data. You do this well in your wording of lines 207-209. This need not apply in the discussion because you're then dealing with the combination of modelling results and behaviour (mating trials).

      We agree with the reviewer that visual modelling only allows to infer what is theoretically discriminated by the butterflies, and that the wording of our sentence is confusing. We therefore modified the sentence to account for those precisions: “Morpho butterflies and predators can theoretically visually perceive the difference in the blue coloration between different subspecies of M. helenor…… using both bird and Morpho visual models” (line 206-209).

      (12) Line 222: Either the chi-square test or Fisher's exact test should be sufficient (why report both?)

      Chi-square test relies on large-sample assumptions (expected counts>5) whereas Fischer’s exact test does not and is valid even with small or unbalanced sample sizes. Since the M. bristowi female/M. h. theodorus male paring only occurred 3 times, we do not meet the primary assumptions to apply a Chi-square test, although it is significant. We used a Fischer’s test to confirm the results. Using both and finding that both tests are significant shows that the results are robust, although they may appear redundant. To simplify, we remove the results of the Chisquare test and only keep the Fisher’s test in the methodology and the results.

      (13) Line 224 (and throughout): Degrees of freedom should be provided for statistical tests.

      We reported the statistic value and the degrees of freedom for all mentions of the statistical tests in the main text, except for the Fischer test which does not rely on an asymptotic distribution like the Chi-squared distribution as it is an exact test.

      (14) Lines 266-267: This sentence has interest, but it is rather vague at present. Wouldn't your controls account for the effect of manipulation? This could be explained further.

      During our mate choice experiments, all Morpho female dummies used for the experiments were painted with black markers, either on their dorsal blue band to modify their blue iridescent phenotype, or on their ventral side, thus controlling for the effect of manipulation. However, we cannot rule out that the modification of the dorsal blue iridescence could have had a “repulsive” effect for males for several reasons. For example, depending on the visual discrimination of darker colors by Morphos, the painted black band could have a slightly different color compared to the dark “brown” usually surrounding their blue iridescent patterns. We now explain this in the results (lines 278-283) and in the methodology (lines 638-639)  

      (15) Line 316: I'm not certain that the similarity is best described as "striking", given a P-value of 0.084 for this contrast

      We agree with the reviewer and removed this adjective for this line.

      (16) Lines 387-390: This sentence is puzzling because, theoretically speaking, we should expect selection on visual preference to be heightened (not relaxed) in sympatry if colouration isincluded among the traits used in mate selection. I'm not certain I have understood the meaning here.

      We would like to thank the reviewer for pointing out this typo. If shared predatory pressures favors convergent evolution of color pattern, then the visual signals become less reliable for species recognition. As a result, sexual selection on visual preference is heightened and becomes stronger, favoring the evolution of alternative cues used to discriminate conspecific mates. We changed the sentence and now write “the convergent evolution of iridescent wing patterns… may have negatively impact visual discrimination and favored the evolution of divergent olfactory cues” (lines 457-458).

      (17) Line 529: Mating experiments. Given that these are quite large butterflies, I wondered whether a 3x3x2m cage would be sufficient in size to allow the expression of male courtship. A brief description of the courtship behaviour in these species or Morphos generally would be a useful addition to the paper.

      A cage this size was enough for the males to express a flight behavior similar to what can be seen in nature, while also being able to see the females (live females or dummies). We tried to perform mate experiments in a larger cage (7m x 5m x 3m) but the trials were not conclusive because male did not find the dummies depending on where they were flying in the cage. A 3mx3mx2m cage is a good compromise maximizing interactions while still allowing enough space to fly. We now describe Morpho male behavior and female behavior in the methods (lines 613-618).

      (18) Line 546: Why are both tests needed (chi-square AND Fisher's exact)?

      Similarly to our answer on recommendations #12, were used both tests to show robustness in the statistical results. We only kept the Fisher’s test results to simplify the results.

    1. Author response:

      The following is the authors’ response to the original reviews

      eLife Assessment

      This valuable study investigates the role of HIF1a signalling in epicardial activation and neonatal heart regeneration in mice. Through a combination of genetic and pharmacological approaches, the authors show that stabilization of HIF1a enhances epicardial activation and extends the regenerative capacity of the heart beyond the typical neonatal window following myocardial infarction (MI). However, several aspects of the study remain incomplete and would benefit from further clarification and additional experimental support to solidify the conclusions.

      We reveal herein prolonged epicardial activation following myocardial infarction (MI) beyond post-natal days 1-7 (P1-P7) by genetic or pharmacological stabilisation of HIF-signalling. This extends the so-called “regenerative window” during an adult-like response to injury, leading to enhanced survived myocardium and functional improvement of the heart, even against a backdrop of persistent, albeit reduced, fibrosis. The epicardium is known to enhance cardiomyocyte proliferation and myocardial growth during heart development via trophic growth factor (for example, IGF-1, FGF, VEGF, TGFβ and BMP) signalling (reviewed in PMID:29592950) and epicardium-derived cell-conditioned medium reduces infarct size and improves heart function (PMID: 21505261). Further experiments, outside of the scope of the current study, are required to determine whether activated neonatal epicardium elicits similar paracrine support to sustain the myocardium and heart function after injury beyond P7 into adulthood.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      The manuscript by Gamen et al. analyzed the functional role of HIF signaling in the epicardium, providing evidence that stabilization of the hypoxia signaling pathway might contribute to neonatal heart regeneration. By generating different conditionally mouse mutants and performing pharmacological interventions, the authors demonstrate that stabilizing HIF signaling enhances cardiac regeneration after MI in P7 neonatal hearts.

      Strengths:

      The study presents convincing genetic and pharmacological approaches to the role of hypoxia signaling in enhancing the regenerative potential of the epicardium.

      Weaknesses:

      The major weakness is the lack of convincing evidence demonstrating the role of hypoxia signaling in EMT modulation in epicardial cells. Additionally, novel experimental approaches should be performed to allow for the translation of these findings to the clinical arena.

      We respectfully disagree that we have not convincingly demonstrated a role for HIF-signalling in promoting epicardial EMT. We adopt epicardial explant assays utilising a well characterised ex vivo protocol previously described for studying EMT in embryonic, neonatal and adult epicardium (PMID: 27023710, PMID: 12297106; PMID: 17108969, PMID: 19235142). These assays demonstrate in WT1<sup>CreERT2</sup>;Phd2<sup>fl/fl</sup> explants enhanced cobblestone to spindle-like change in cell morphology, increased cell migration, appearance of stress fibres and an up-regulation of the mesenchymal marker alpha-smooth muscle actin (αSMA); all parameters associated with EMT. In addition, our in vivo analyses of Wt1<sup>CreERT2</sup>;Phd2<sup>fl/fl</sup> hearts, in response to neonatal injury, reveal elevated numbers of WT1+ epicardial cells within the sub-epicardial region and underlying myocardium as is associated with active EMT and subsequent migration from the epicardium.

      Reviewer #2 (Public review):

      Summary:

      In this study, Gamen et al. investigated the roles of hypoxia and HIF1a signaling in regulating epicardial function during cardiac development and neonatal heart regeneration. They found that WT1<sup>+</sup> epicardial cells become hypoxic and begin expressing HIF1a from mid-gestation onward. During development, epicardial HIF1a signaling regulates WT1 expression and promotes coronary vasculature formation. In the postnatal heart, genetic and pharmacological upregulation of HIF1a sustained epicardial activation and improved regenerative outcomes.

      Strengths:

      HIF1a signaling was manipulated in an epicardium-specific manner using appropriate genetic tools.

      Weaknesses:

      There appears to be a discrepancy between some of the conclusions and the provided histological data. Additionally, the study does not offer mechanistic insight into the functional recovery observed.

      We respectfully disagree with the comment that our histological data does not support our conclusions and expand on this in the response to specific reviewer comments. We agree that further mechanistic experiments outside of the scope of the current study are required to identify precisely how activated neonatal epicardium results in increased healthy myocardium after injury beyond post-natal day 7 (P7).

      Reviewer #3 (Public review):

      Summary:

      The authors' research here was to understand the role of hypoxia and hypoxia-induced transcription factor Hif-1a in the epicardium. The authors noted that hypoxia was prevalent in the embryonic heart, and this persisted into neonatal stages until postnatal day 7 (P7). Hypoxic regions in the heart were noted in the outer layer of the heart, and expression of Hif-1a coincided with the epicardial gene WT1. It has been documented that at P7, the mouse heart cannot regenerate after myocardial infarction, and the authors speculated that the change in epicardial hypoxic conditions could play a role in regeneration. The authors then used genetic and pharmacological tools to increase the activity of Hif genes in the heart and noted that there was a significant improvement in cardiac function when Hif-1a was active in the epicardium. The authors speculated that the presence of Hif-1a improved cell survival.

      Strengths:

      A focus on hypoxia and its effects on the epicardium in development and after myocardial infarction. This study outlines the potential to extend the regenerative time window in neonatal mammalian hearts.

      We thank the reviewer for this positive endorsement and recognition of the importance of mechanistic insight into how to extend the window of neonatal heart regeneration.

      Weaknesses:

      While the observations of improved cardiac function are clear, the exact mechanism of how increased Hif-1a activity causes these effects is not completely revealed. The authors mention improved myocardium survival, but do not include studies to demonstrate this.

      We report an increase in healthy myocardium arising from prolonged activation of the epicardium during the neonatal window and following injury at post-natal day 7 (P7). We speculate this recapitulates the role of the epicardium during heart development which is known to be a source of trophic growth factors that can enhance myocardial growth. Further experiments are required, out-of-scope of this study, to define a mechanistic link between HIF-signalling, epicardial activation and myocardial survival in the setting of prolonged neonatal heart regeneration.

      There is an indication that fibrosis is decreased in hearts where Hif activity is prolonged, but there are no studies to link hypoxia and fibrosis.

      We believe the decreased fibrosis is a natural consequence of the increase in survived myocardium arising from the activated epicardium. There is strong precedent here following injury at post-natal day 1 (P1) in which fibrosis is evident early-on but is resolved over time with growth of the myocardium in the regenerating heart (PMID: 23248315).

      Recommendations for the authors:

      Reviewing Editor Comments:

      (1) Address issues related to image quality, colocalization, sample labeling, appropriate controls, and quantification - particularly in Figures 1, 2, 6, and Supplementary Figure 9. Increase sample size as noted by reviewers.

      The issues of co-localisation and sample labelling have been addressed under response to reviewers. We are unable to increase sample numbers but have clarified the number of regions per section and numbers of sections per heart analysed where appropriate.

      (2) Clarify the effects of epicardial HIF1a activation on neovascularization.

      We have removed reference in the abstract to an effect on neovascularisation.

      (3) Extend assessments of epicardial hypoxia and HIF1a expression to earlier embryonic stages, when epicardial EMT is more active.

      Our earliest timepoint of E12.5 marks the onset of epicardial EMT and E13.5 is the stage with the most significant mobilisation of epicardium-derived cells (EPDCs) into the sub-epicardial region and underlying myocardium (PMID: 32359445). In the same study, E11.5 lineage tracing of epicardial cells is restricted to outer layer of the heart; thus, our timepoints are representative in capturing both the onset and progression of in vivo EMT.

      (4) Strengthen EMT assays and mechanistic modeling. Provide evidence from physiologically relevant models, as current 2D culture assays do not adequately support conclusions about EMT. Include additional EMT markers and quantification where appropriate.

      We respectfully disagree that epicardial explants are not a valid assay for assessing EMT. As noted under responses to reviewers, such primary explants have been widely described elsewhere (PMID: 27023710, PMID: 12297106; PMID: 17108969, PMID: 19235142) and enable documentation of multiple parameters that are associated with active EMT, including an assessment of the extent of cell migration, cobblestone (epithelial) to spindle-like (mesenchymal) cell morphologies, stress fibre formation and expression of alpha-smooth muscle actin as a mesenchymal marker. We support our findings in explants by revealing reduced WT1+ epicardium-derived cells (EPDCs) in the sub-epicardial region and underlying myocardium of WT1<sup>CreERT2/+</sup>;Hif1a<sup>fl/fl</sup> embryonic hearts (data in Figure 2) indicative of impaired epicardial EMT and migration of EPDCs and in vivo following neonatal MI with pharmacological inhibition of PHD2, where we observe the reciprocal phenotype of increased numbers of epicardium-derived cells emerging from the outer epicardial layer (data in Figure 6).

      (5) Strengthen mechanistic insights into the role of epicardial cells in the functional recovery observed in MI hearts.

      We agree that further experiments are required, out-of-scope of this study, to define a mechanistic link between HIF-signalling, epicardial activation and myocardial survival in the setting of prolonged neonatal heart regeneration.

      Reviewer #1 (Recommendations for the authors):

      The manuscript by Gamen et al. analyzed the functional role of HIF signaling in the epicardium, providing evidence that stabilization of the hypoxia signaling pathway might contribute to neonatal heart regeneration. By generating different conditionally mouse mutants and performing pharmacological interventions, the authors demonstrate that stabilizing HIF signaling enhances cardiac regeneration after MI in P7 neonatal hearts. The study is potentially interesting, but it presents several major caveats.

      (1) One of the critical points reported in the early stages of this study is the early co-localization of Wt1, the hypoxic report (HP1), and HIF signaling pathways master regulators (i.e., HIF1a and HIF1b) during embryonic development. Figure 1 is meant to report such findings. However, unfortunately, I hardly see any co-localization at all in the Wt1+ epicardial cells for HP1, with some colocalization is seen for HIF1 and 2 alpha, although none of these data are quantified. Thus, it is hard to believe such co-localization.

      We respectfully disagree with this comment. We highlight cells in Figure 1 that are co-stained for WT1+ and HP1. In addition, we identify HIF1-α and HIF2- α positive cells which either reside within the epicardium, as the outer cell layer, or within the underlying sub-epicardial region, respectfully.

      (2) The authors claimed that they have analyzed the expression of the hypoxic report, as well as Wt1 and the HIF signaling pathways master regulators (i.e., HIF1a and HIF1b) in the AV groove, as compared to the apex, in embryonic heart ranging from E12.5 to E18.5 (Figure 1). Unfortunately, all images provided that are tagged as AV groove are rather misleading. They do not represent the AV groove but part of the right ventricular free wall. If the authors want to refer to the AV groove, AV cushions should be visible underneath.

      We have removed specific reference to the AV groove and refer to the highlighted regions as the “Base” of the heart.

      (3) The authors analyzed the hypoxic condition of the developing heart from E12.5 to E18.5. However, it remains unclear why the authors only explored the hypoxic conditions from E12.5 onwards, since epicardial EMT mainly occurs earlier than this time point, i.e., E10.5 onwards. Therefore, it would be needed to explore it already at this earlier time point.

      We respectfully disagree with the reviewer and refer to the comment above regarding the fact that E12.5 marks the onset of epicardial EMT and E13.5 is the stage with the most significant mobilisation of epicardium-derived cells (EPDCs) into the sub-epicardial region and underlying myocardium (PMID: 32359445).

      (4) The authors reported a conditional mouse model of HIF1alpha deletion by using the Wt1CreERT2 driver. Curiously, Wt1 is dependent on hypoxia signaling (i.e., HIF1a). Therefore, it is unclear whether there is a negative feedback loop between the deletion of Hif1alpha and the activation of the Cre driver might have functional consequences. Convincing evidence should be provided that such crosstalk does not interfere with Hif1alpha inactivation, and therefore, appropriate controls should be run in parallel.

      We discount a negative feedback loop in this instance based on the fact we have utilised heterozygous mice for the WT1<sup>CreERT2/+</sup> line and observe a consistent and reproducible phenotype for the developing hearts on a Wt1<sup>CreERT2/+</sup>;Hif1a<sup>fl/fl</sup> background and following injury in Wt1<sup>CreERT2/+</sup>;Phd2<sup>fl/fl</sup> mice. Collectively this indicates that the WT1-CreERT2 driver is active in the context of diminishing HIF-1α and Phd2, respectively. In addition, have carried out parallel experiments using epicardial explants derived from R26R-CreERT2;Phd2<sup>fl/fl</sup> (Figure 3) to circumvent any potential confounding issues; the results of which are consistent with increased epicardial EMT in support of our overall hypothesis.

      (5) On Figure 2a-f the authors reported that epicardial cells are diminished in Wt1CreERT2Hif1alpha mice as compared to controls. I am very sorry, but I do not see any difference. Furthermore, it is unclear to me how the authors quantified such differences, i.e., what marker signal did they use and how it was performed (Figure 2c and d)?

      We respectfully disagree with the reviewer and draw attention to the single channel panels of WT1+ staining in Figure 2, which show clear differences between numbers of epicardial cells in the mutant mice compared to controls (comparing magenta cells in panels a) versus b). Quantification was carried out for numbers of WT1+ cells residing within the PDPN-positive epicardium (and underlying PDPN-negative myocardium) across multiple images from multiple sections and multiple hearts.

      (6) On Figure 2g, the authors reported differences in total vessel length. Are they referring to impaired microvasculature development? Or is this analysis also including major coronary vessels? What about the major coronary vessels and trees, is there any affection?

      This analysis refers to the microvasculature and not the major coronary arteries or coronary trees.

      (7) The authors reported that there might be some differences in EMT markers, but unfortunately, all of them are analyzed on 2D cultures, where no substrate for EMT is present, i.e., an underlying ECM bed. Thus, the authors cannot claim that EMT is altered. Additional experiments using either collagen substrate and/or Matrigel are required to fully demonstrate that EMT is impaired. Furthermore, quantitative analyses of such differences should be provided.

      The 2D cultures are epicardial explants from mutant versus wild type hearts and represent a widely adopted previously published ex-vivo assay for investigating epicardial EMT across embryonic to adult stages (PMID: 27023710, PMID: 12297106; PMID: 17108969, PMID: 19235142); including an assessment of the extent of migration and cobblestone (epithelial) to spindle-like (mesenchymal) cell morphologies, stress fibre formation and expression of alpha-smooth muscle actin as a mesenchymal marker. We do not understand the comment regarding an “underlying ECM bed” as the cells exhibit EMT routinely on tissue culture plastic and will deposit their own ECM during the culture time course and in response to EMT/cell migration. In terms of quantification this was carried out for scratch assay experiments, as a proxy for EMT and emergent mesenchymal cell migration, as presented in Figure 3i, j with significant enhanced scratch closure and cell migration following Molidustat treatment.

      (8) The description of data provided on Supplementary Figure 5 is spurious and should be removed. A note in the discussion might be sufficient.

      We respectfully disagree. The ChIP-seq data, in what is now Figure 2- figure supplement 3, highlights a HIF-1 α binding site within the Wt1 locus suggesting putative upstream regulation of WT1 by HIF-1α. Thus this provides a potential explanation as to how HIF-1α may activate the epicardium through up-regulation of Wt1/WT1.

      (9) On Figure 3, the authors further illustrate the change of EMT markers using ex vivo cardiac explants. They reported increased expression of Snai2 that, although statistically significant, is most likely of no biological relevance (increase of only 20% at transcript level). What about Snai1, Prrx1, and other EMT promoters? Are they also induced? As previously stated, these 2D cultures do not provide supporting evidence that EMT is occurring, thus 3D gel assays should be performed in which Z-axis analyses will provide evidence on the different migratory behaviour of those cells.

      We respectfully suggest that a 20% change in snai2 expression is biologically meaningful with respect to EMT. This in-turn is supported by associated cell migration, reduced ZO-1 expression, increased stress fibres and increased alpha-SMA as a mesenchymal marker; all properties associated with active EMT. Other suggested markers have not been validated as formally required for EMT, for example Snai1 (PMID: 23097346). The migratory capacity of targeted versus epicardial cells was assessed by combined explant and scratch assay experiments.

      (10) The description of single-cell analyses is very incomplete. Which mice were used for these analyses, wildtype control, or hypoxic mice? Please provide a clearer description of the samples used. Additionally, the entire rationale of these analyses is dubious. Doing single-cell analyses to analyze a couple or three markers in a very small cell population is rather ridiculous. qPCR might be far more appropriate and convincing, or a bulk RNAseq analysis of isolated epicardial cells.

      The single-cell analyses represent an unbiased assessment of different pathways in epicardial cells (identified bioinformatically) between intact P1 and P7 stages in wild type (control) hearts, with a focus on hypoxia-related gene expression and HIF-dependent pathways. It was not designed to analyse a small number of genes, rather global differences in the hypoxic states between P1 and P7 hearts. Selected genes (Vegfa, Pdk3, Egln 1 (Phd2)) were analysed to highlight the key differences in hypoxic signalling across the regenerative window. The fact the hearts were uninjured/intact is clarified in the text and legends for Figure 4 and now Figure 4-figure supplement 1.

      (11) The analyses provided in Figure 5 are very interesting and their findings are very relevant. However, I would think that the complementary experimental approach should also be done, i.e, MI followed by activation with tamoxifen, since that situation would be more realistic in the clinical setting.

      Tamoxifen causes respiratory failure in neonates with MI, so the two cannot be combined at the same time or soon after surgery. Moreover, tamoxifen takes significant time to take effect on targeted gene down-regulation which may negate sufficient activation of the epicardium following injury.

      The experiments in Figure 5 were designed to demonstrate that prolonged heart regeneration could be elicited in a cell-specific (epicardial-specific) manner via a genetic approach. The pharmacological experiments in Figure 6 are complementary in this regard by demonstrating equivalent effects with drug (Molidustat) delivery to reduce PHD2 and stabilise HIF post-MI.

      (12) In Figure 6, expression of Wt1 is highly prominent in P7 controls, mainly restricted to the epicardial lining while in the experimental setting, such Wt1 expression is broadly distributed on the subepicardial space, nicely demonstrating epicardial activation. However, it is very surprising to see such Wt1 expression in controls, something that is not expected, as compared to the data reported in Figure 4g. Could the authors please reconcile these findings?

      Figure 6 represents the injury setting and Figure 4g the intact setting (as clarified above, in the text and revised figure legends). Hence in the latter WT1 expression is significantly reduced in the P7 heart, as anticipated. With injury at P7 we anticipate activation of WT1 in control hearts, albeit restricted to the epicardial layer (as occurs in adult hearts, PMID: 21505261). In contrast, following Molidustat-treatment of P7 hearts post-MI we observe extensive epicardial expansion into the sub-epicardial region and EPDC migration into the underlying myocardium (Figure 6b).

      Reviewer #2 (Recommendations for the authors):

      The role of hypoxia and HIF1a signaling in epicardial activation is an important topic, and the genetic approaches employed in this study are appropriate. However, several aspects of the study remain unclear and would benefit from further clarification or explanation by the authors:

      (1) The authors detected hypoxic regions using an anti-pimonidazole fluorescence-conjugated monoclonal antibody (HP1). The data would become more compelling if negative and positive controls were provided.

      We believe the HP1 staining is compelling in the images shown and is consistent with hypoxic regions of the developing heart. We reveal HP1 staining at cellular resolution with neighbouring cells positive and negative for the HP1 signal in the apex of the heart and within the epicardium and sub-epicardial regions at E12.5 (Figure 1a) and diminished/altered hypoxic/HP1 regional signal through subsequent developmental stages at E14.5-18.5 (Figure 1a-d).

      (2) Many HIF1a-positive cells in the AV groove region do not appear to overlap with HP1 staining (Figure 1a). Providing a low-magnification image of HIF1α expression would be helpful to better assess the extent of overlap with HP1 staining

      HIF-1 is highly unstable and hence detection of HIF-1+ cells will likely only sample of cells compared to HP1 which is a surrogate for broader regions of hypoxia.

      (3) Although the authors conclude that epicardial HIF1a deletion results in a significant reduction of WT1⁺ cells in both the epicardium and myocardium (Figure 2a-d), the provided images are not sufficiently clear to fully support this interpretation. Providing additional evidence to support this conclusion would be helpful.

      We respectfully disagree with the reviewer and draw attention to the single channel panels of WT1+ staining which show clear differences between numbers of epicardial cells in the mutant mice compared to controls (Figure 2a versus 2b; magenta WT1+ staining).

      (4) Similar to the point raised above, the authors' conclusion regarding the increased expression of WT1 following Molidustat treatment does not appear to be fully supported by the provided images (Figure 6b-f). Immunofluorescence staining for WT1 does not clearly demonstrate epicardial expression in the remote zone of either the control or Molidustat-treated hearts. In addition, while an increase of WT1<sup>+</sup> cells is observed in the infarct zone of the Molidustat-treated heart, it is somewhat unexpected that such expansion is not evident in the corresponding region of the control heart, given that epicardial cells typically expand near the infarct area. Clarification on these points would be helpful.

      Figure 6b reveals WT1 expression in controls (upper panel set) that is reactivated proximal to the infarct region, given WT1 is not expressed in adult epicardium but restricted to the epicardial layer (as occurs in injured adult mouse hearts PMID: 21505261). This contrasts with what is observed in the Molidustat-treated P7 hearts post-MI, where we observe epicardial expansion and migration of WT1+ cells into the underlying myocardium (Figure 6b, lower panel set, infarct zone).

      (5) The authors conclude that WT1<sup>+</sup> cells in the myocardial tissue exhibit endothelial identity based on the colocalization of WT1 and EMCN signals (Supplementary Figure 9c). However, this interpretation is difficult to assess, as WT1 is a nuclear marker and EMCN is a membrane protein, which makes precise colocalization challenging to confirm with confidence. Additional supporting evidence may be necessary to substantiate this conclusion.

      WT1 is known to be up regulated in endothelial cells in response to injury as shown previously in several studies (for example, PMID: 25681586). Here we show clear co-localisation of nuclear WT1 and cytoplasmic Endomucin (EMCN) in what is now Figure 6- figure supplement 1c and would encourage the reviewer and readers to magnify the image by zooming-in on the relevant co-stained panel.

      (6) The authors conclude that activation of epicardial HIF1a signaling has no effect on neovascularization in postnatal MI hearts (Figure 5c). However, the abstract states: "Finally, a combination of genetic and pharmacological stabilisation of HIF ... increased vascularisation, augmented infarct resolution and preserved function beyond the 7-day regenerative window" (Lines 38-41). Clarification regarding this apparent discrepancy would be appreciated.

      The abstract has been altered to remove the statement of increased vascularisation.

      (7) The study appears somewhat incomplete, as it lacks mechanistic insight into the functional recovery observed following epicardial Phd2 deletion and Molidustat treatment in postnatal MI hearts. Although the authors suggest a potential paracrine role of the epicardium in protecting cardiomyocytes from apoptosis, this hypothesis has not been experimentally addressed. Incorporating such analysis would help to reinforce the study's conclusions.

      Further experiments are required, which are out-of-scope of this study, to define a mechanistic link between the genetic or pharmacological stabilisation of HIF-signalling, epicardial activation and myocardial survival in the setting of prolonged neonatal heart regeneration.

      Other points:

      (1) Providing single-channel images for Figures 1a-d and 6g would be helpful for clarity and interpretation.

      We believe the combined channel views of co-staining for two markers on a background of DAPI staining to pin-point cell nuclei, are informative and support our conclusions.

      (2) Have the authors considered using AngioTool to quantify the number of vessels in Figure 5b-c?

      AngioToolTM was used to quantify the vessels, as we have used previously (PMID: 33462113) and this is now added to the methods and legend of Figure 2.

      Reviewer #3 (Recommendations for the authors):

      There are several areas where the manuscript can be improved, such that its conclusions can be solidified.

      (1) The authors highlight a point where blocking Phd2 can enhance survival of cardiac tissue, but did not report on survival markers. They surmised that apoptosis could be decreased in Phd2 mutant or Molidustat treatment but did not show this. The authors should determine if apoptosis is decreased in the myocardium and epicardium.

      We show evidence of increased levels of healthy myocardium in the genetic and pharmacological models of stabilised HIF-signalling. We exclude increased cardiac hypertrophy or increased cardiomyocyte proliferation as causative, so suggest as a reasonable alternative enhanced survival, albeit this need not necessarily be via an apoptotic pathway given the incidence of necrotic cell death during MI. We are unable to generate new surgeries and mutant/treated heart samples to analyse for apoptotic markers at this stage.

      (2) There appears to be no difference in cardiomyocyte proliferation in Molidustat-treated animals, but the experiment was only performed on 2 to 3 animals. This is too small a sample size to conclude from these results. The authors should increase the sample size to make this assertion.

      We respectfully disagree that we are unable to conclude no effect on cardiomyocyte proliferation. We analysed multiple heart regions per section, for EdU+/cTnT+ colocalised signals across several sections per heart, set against a consistency of effect on other parameters in hearts treated with Molidustat. We are unable to generate more P7 heart surgeries +/- Molidustat and +/- EdU at this stage.

      (3) It is curious as to how, after myocardial infarction, the fibrotic scar tissue is decreased in the Phd2 deletion but not as profound in Molidustat-treated mice at d21. Can the authors speculate why the difference exists and how this decrease arises? For example, are there decreased pro-inflammatory signals in Phd2 deleted mice? Is there decreased collagen deposition and ECM gene expression? Do macrophage recruitment into the infarct zone differ between mutant/treated vs WT?

      The representative images in Figure 6k reveal a trend towards reduced fibrosis with Molidistat treatment (Figure 6l), but across all hearts analysed this was not as significant as observed in the epicardial-specific deletion injured hearts (Figure 5g, h). This may be due to the relatively short half-life of Molidustat (approximately 4-10 hours, PMID: 32248614), the dosing regimen for the drug and/or the fact that it was not specifically delivered/targeted to the epicardium.

      (4) The magnified images in Figure 1 do not match the boxes in the whole heart images. It is unclear what the white boxes signify.

      The white boxes have been removed from Figure 1. The magnified image panels are from serial heart sections and this is now clarified in the Figure 1 legend.

    1. However, most societies do not value creative thinking and so our skills in generating ideas rapidly atrophies, as we do not practice it, and instead actively learn to suppress it11 Csikszentmihalyi, M. (2014). Society, culture, and person: A systems view of creativity. Springer Netherlands. . That time you said something creative and your mother called you weird? You learned to stop being creative. That time you painted something in elementary school and your classmate called it ugly? You learned to stop taking creative risks. That time you offered an idea in a class project and everyone ignored it? You must not be creative. Add up all of these little moments and where most people end up in life is possessing a strong disbelief in their ability to generate ideas

      I agree with the idea that our society actively works to suppress creativity. This affirms my perspective that we often prioritize getting the right answers rather than thinking creatively in order to get a range of answers for a question. I think this because we, inherently, as humans think of things in black and white. If something isn't the "right" or "correct" idea, it is simply wrong. In reality, these answers may not be wrong and may just be different. Through my own experiences at school, I've seen how people are quick to shut down the idea generation process to just skip ahead to the solution. Especially with generative AI now, we're outsourcing our thinking. This is harmful because we need to be able to think. If we can't think, we can't create.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review): 

      Weaknesses: 

      (1) The authors claim that choroidal neovascular tuft phenotypes are similar in TgfbrR1 KO and TgfbrR2 KO mice. However, the phenotypes look more severe in the TgfbrR1 KO rather than TgfbrR2 KO mice. Can the authors show a quantitative comparison of the number of choroidal neovascular tufts per whole eye cross-section in both genotypes? 

      Thank you for asking about this.  Each VE-cad-CreER;TGFBR1 CKO/- and VE-cad-CreER;TGFBR2 CKO/- retina exhibits multiple zones of choroidal neovascularization.  The examples in Figures 1 and Figure 1 – Figure supplements 1 and 2 are mostly from retinas with loss of TGFBR1, but we could have chosen similar examples from retinas with loss of TGFBR2.  The quantification in the original version of Figure 1- Figure supplement 1 panel C had a labeling error.  It actually showed the quantification choroidal neovascularization (CNV) in the sum of both VE-cad-CreER;TGFBR1 CKO/- and VE-cad-CreER;TGFBR2 CKO/- retinas, not only in VE-cad-CreER;TGFBR1 CKO/- retinas as originally labeled.  The point that it made is that CNV is seen with loss of TGF-beta signaling but not in control retinas or retinas with loss of Norrin signaling.  We have now updated that plot by separating the data points for VE-cad-CreER;TGFBR1 CKO/- and VE-cad-CreER;TGFBR2 CKO/- retinas, so that they can be compared to each other.   The result shows ~2.5-fold more CNV in VE-cad-CreER;TGFBR2 CKO/- retinas compared to VE-cad-CreER;TGFBR1 CKO/-.  We think it likely that a more extensive sampling would show little or no difference between these two genotypes – but the data is what it is. This is now described in the Results section. 

      We have also added a panel D to Figure 1- Figure supplement 1, which shows a retina flatmount analysis of CNV.  This is done by mounting the retina with the photoreceptor side up so that the outer retina can be optimally imaged. 

      (2) In the analysis of Sulfo-NHS-Biotin leakage in the retina to assess blood-retina barrier maturation. The authors claim that there is increased vascular leakage in the TgfbR1 KO mice. However, it does not seem like Sulfo-NHS-biotin is leaking outside the vessels. Therefore, it cannot be increased vascular permeability. Can the authors provide a detailed quantification of the leakage phenotype? 

      Thank you for raising this point.  Your comment prompted us to look at this question in greater depth with more experiments.  We have expanded Figure 2 to show and quantify a comparison between control (i.e. phenotypically WT), NdpKO, and TGFBR1 endothelial KO and we have expanded the associated part of the Results section (Figure 2C and D).  In a nutshell, control retinas show little Sulfo-NHS-biotin accumulation in or around the vasculature or in the parenchyma; NdpKO retinas show Sulfo-NHS-biotin accumulation in the vasculature and in the parenchyma (i.e., the area between the vessels); and VEcadCreER;Tgfbr1CKO/- retinas show Sulfo-NHS-biotin accumulation in the vascular tufts with minimal accumulation in the non-tuft vasculature and minimal leakage into the parenchyma.   The conclusion is that the bulk of the retinal vasculature in TGFBR1 endothelial KO mice is minimally or not at all leaky – very different from the situation with loss of Norrin/Frizzled4 signaling.

      (3) The immune cell phenotyping by snRNAseq is premature, as the number of cells is very small. The authors should sort for CD45+ cells and perform single-cell RNA sequencing. 

      Thank you for raising this point.  For the revised manuscript, we have performed additional snRNAseq analyses using the same tissue processing protocol as for our original snRNAseq data.  We have opted to homogenize the tissue and prepare nuclei (our original method) rather than dissociate the tissue and FACS sorting for CD45+ cells because the nuclear isolation approach is unbiased – we assume that nuclei from all cell types are present after tissue homogenization.  By contrast, we cannot be certain that CD45 FACS will capture the full range of immune cells since some cells may not express CD45, may express CD45 at low level, or may be tightly adherent to other cells, such as vascular endothelial cell.  Additionally, by following the original protocol, we can combine the original snRNAseq dataset and the new snRNAseq dataset.  In the revised manuscript we present the snRNAseq data from the combination of the original and the more recent snRNAseq datasets (revised Figure 4; N=628 immune cell nuclei).  The new analysis comes to the same conclusions as the original analysis: the immune cell infiltrate in the mutant retinas is composed of a wide variety of immune cells.

      (4) The analysis of BBB leakage phenotype in TgfbR1 KO mice needs to be more detailed and include tracers as well as serum IgG leakage. 

      As described in our response to query 2, we have conducted additional experiments to look at vascular leakage in control, VE-cad-CreER;TGFBR1 CKO/-, and NdpKO retinas.  We have also looked at Sulfo-NHS-biotin leakage in the VE-cadCreER;TGFBR1 CKO/- brain, and it is indistinguishable from WT controls.  Since Sulfo-NHS-biotin is a low MW tracer (<1,000 kDa), this implies that loss of TGF-beta signaling does not increase non-specific diffusion of either low or high MW molecules.  Therefore, the elevated levels of IgG in the brain parenchyma in young VE-cad-CreER;TGFBR1 CKO/- mice (Figure 8A) likely represents specific transport of IgG across the BBB.  Such transport is known to occur via Fc receptors expressed on vascular endothelial cells, although it is normally greater in the brain-to-blood direction than in the blood-to-brain direction.  For example, see Lafrance-Vanasse et al (2025) Leveraging neonatal Fc receptor (FcRn) to enhance antibody transport across the blood brain barrier.  Nat Commun. 16:4143.  This is now described in greater detail in the Results section.

      (5) A previous study (Zarkada et al., 2021, Developmental Cell) showed that EC-deletion of Alk5 affects the D tip cells. The phenotypes of those mice look very similar to those shown for TgfbrR1 KO mice. Are D-tip cells lost in these mutants by snRNAseq? 

      Please note: Alk5 is another name for TGFBR1.  This is noted in the second sentence of paragraph 4 of the Introduction.  The reviewer is correct: there are a lot of similarities because these are exactly the same KO mice.  Also, Zarkada and we used the same VEcadCreER to recombine the CKO allele.  The proposed snRNAseq analysis would serve as an independent check on the diving (D) tip vs stalk cell analyses published in Zarkada et al (2021) Specialized endothelial tip cells guide neuroretina vascularization and blood-retina-barrier formation. Dev Cell 56:2237-2251.  We have not gone in this direction because the question of tip vs. stalk cells and of subtypes of tip cells in WT vs. mutant retinas is beyond our focus on choroidal neovascularization and the role of immune cells and vascular inflammation.  The proposed snRNAseq analysis would also require a major effort since tip cells are rare and must be harvested from large numbers of early postnatal retinas followed by FACS enrichment for vascular endothelial cells.  Finally, we have no reason to doubt the results of Zarkada et al.

      Reviewer #2 (Public review): 

      Summary:

      The authors meticulously characterized EC-specific Tgfbr1, Tgfbr2, or double knockout in the retina, demonstrating through convincing immunostaining data that loss of TGF-β signaling disrupts retinal angiogenesis and choroidal neovascularization. Compared to other genetic models (Fzd4 KO, Ndp KO, VEGF KO), the Tgfbr1/2 KO retina exhibits the most severe immune cell infiltration. The authors proposed that TGF-β signaling loss triggers vascular inflammation, attracting immune cells - a phenotype specific to CNS vasculature, as non-CNS organs remain unaffected. 

      Strengths: 

      The immunostaining results presented are clear and robust. The authors performed well-controlled analyses against relevant mouse models. snRNA-seq corroborates immune cell leakage in the retina and vascular inflammation in the brain. 

      Weaknesses: 

      The causal link between TGF-β loss, vascular inflammation, and immune infiltration remains unresolved. The authors' model posits that EC-specific TGF-β loss directly causes inflammation, which recruits immune cells. However, an alternative explanation is plausible: Tgfbr1/2 KO-induced developmental defects (e.g., leaky vessels) permit immune extravasation, subsequently triggering inflammation. The observations that vein-specific upregulation of ICAM1 staining and the lack of immune infiltration phenotypes in the non-CNS tissues support the alternative model. Late-stage induction of Tgfbr1/2 KO (avoiding developmental confounders) could clarify TGF-β's role in retinal angiogenesis versus anti-inflammation. 

      Thank you for raising this point.  Your comment prompted us to look at this question in greater depth with more experiments.  We have expanded Figure 2 to show and quantify a comparison between control (i.e. phenotypically WT), NdpKO, and TGFBR1 endothelial KO and we have expanded the associated part of the Results section (Figure 2C and D).  In a nutshell, control retinas show little Sulfo-NHS-biotin accumulation in or around the vasculature or in the parenchyma; NdpKO retinas show Sulfo-NHS-biotin accumulation in the vasculature and in the parenchyma (i.e., the area between the vessels); and VEcadCreER;Tgfbr1CKO/- retinas show Sulfo-NHS-biotin accumulation in the vascular tufts with minimal accumulation in the non-tuft vasculature and minimal leakage into the parenchyma.   The conclusion is that the bulk of the retinal vasculature in TGFBR1 endothelial KO mice is minimally or not at all leaky – very different from the situation with loss of Norrin/Frizzled4 signaling.

      In the revised manuscript, we have expanded the Discussion section to address the two alternative hypotheses raised by the reviewer.  Here are the relevant data in a nutshell: (1) vascular leakage into the parenchyma, as measured with sulfo-NHSbiotin, in TGFBR1 endothelial CKO retinas is far less than in NdpKO retinas, where nearly all ECs convert to a fenestration+ (PLVAP+) phenotype and there is leakage of sulfo-NHS-biotin, (2) ICAM1 in ECs in TGFBR1 endothelial CKO retinas increases several-fold more than in NdpKO or Frizzled4KO retinas, (3) TGFBR1 endothelial CKO retinas have more infiltrating immune cells than NdpKO or Frizzled4KO retinas, and (4) in TGFBR1 endothelial CKO retinas large numbers of immune cells are observed within and adjacent to blood vessels.  We think that the simplest explanation for these data is that loss of TGFbeta signaling in ECs causes an endothelial inflammatory state with enhanced immune cell extravasation.  That said, the case for this model is not water-tight, and there could be less direct mechanisms at play.  In particular, this model does not explain why the inflammatory phenotype is limited to CNS (and especially retinal) vasculature.

      Regarding the last sentence of the reviewer’s comment (“Late stage induction…”), we have tried activating CreER recombination at different ages and we observe a large reduction in the inflammatory phenotype when recombination is initiated after vascular development is complete.   This observation suggests that the vascular developmental/anatomic defect – and perhaps the resulting retinal hypoxia response – is required for the inflammatory phenotype.  In the revised manuscript we have expanded the Results and Discussion sections to describe this observation.

      Reviewer #1 (Recommendations for the authors): 

      Suggestions for experiments: 

      (1) The authors need to show a quantitative comparison of the number of choroidal neovascular tufts per whole eye crosssection in both genotypes (TgfbR1 and TgfbR2 KO mice). 

      Thank you for raising this point.  The quantification in the original version of Figure 1- Figure supplement 1 panel C was mis-labeled.  It quantifies choroidal neovascularization (CNV) in both VE-cad-CreER;TGFBR1 CKO/- and VE-cadCreER;TGFBR2 CKO/- retinas, not VE-cad-CreER;TGFBR1 CKO/- retinas only as originally labeled.  The point it makes is that CNV is seen with loss of TGF-beta signaling but not in control retinas or retinas with loss of Norrin signaling.  We have now corrected that plot by separating the data points for VE-cad-CreER;TGFBR1 CKO/- and VE-cad-CreER;TGFBR2 CKO/- retinas, so that they can be compared to each other.   The result shows ~2.5-fold more CNV in VE-cad-CreER;TGFBR2 CKO/- retinas compared to VE-cad-CreER;TGFBR1 CKO/-.  This is now described in the Results section. 

      (2) In the analysis of Sulfo-NHS-Biotin leakage in the retina to assess blood-retina barrier maturation. The authors should provide a detailed quantification of the leakage phenotype outside the vessels into the CNS parenchyma, both in the retina and brain, in TgfbR1 KO mice. 

      Thank you for raising this point.  There is no detectable Sulfo-NHS-biotin leakage into the brain parenchyma in VE-cadCreER;TGFBR1 CKO/- mice.  We have expanded Figure 2 to show and quantify the data for retinal vascular leakage (Figure 2C and D).  The data show that in VE-cad-CreER;TGFBR1 CKO/- mice there is accumulation of Sulfo-NHS-biotin in the vascular tufts but minimal accumulation elsewhere in the retinal vasculature and minimal leakage of Sulfo-NHS-biotin into the retinal parenchyma.

      (3) The immune cell phenotyping by snRNAseq is premature, as the number of cells is very small. The authors should sort for CD45+ cells and perform single-cell RNA sequencing to ascertain these preliminary data. 

      Thank you for raising this point.  We have performed additional snRNAseq analyses using the same tissue processing protocol as for our original snRNAseq data to increase the numbers of cells.  We have opted to homogenize the tissue and prepare nuclei (our original method) rather than dissociating the cells and FACS sorting for CD45+ cells because the nuclear isolation approach is unbiased – we assume that nuclei from all cell types are present.  By contrast, we cannot be certain that CD45 FACS will capture the full range of immune cells, since some cells may not express CD45, may express CD45 at low level, or may be tightly adherent to other cells, such as vascular endothelial cell.  Additionally, by following the original protocol, we can combine the original snRNAseq dataset of and the new snRNAseq dataset.  In the revised manuscript we present the snRNAseq data from the combination of the original and the more recent snRNAseq datasets (revised Figure 4; N=628 immune cell nuclei).  The new analysis comes to the same conclusion as in the original submission, namely that the immune cell infiltrate in the mutant retinas is composed of a wide variety of immune cells.  The Results section has been expanded to describe this new data and analysis.    

      (4) The analysis of BBB leakage phenotype in TgfbR1 KO mice needs to be more detailed and include tracers as well as serum IgG leakage. 

      Sulfo-NHS biotin leakage in the VE-cad-CreER;TGFBR1 CKO/- brain is minimal, and it is indistinguishable from WT controls.  Since Sulfo-NHS biotin is a low MW tracer (<1,000 kDa), this implies that loss of TGF-beta signaling does not increase non-specific diffusion of either low or high MW molecules.  Therefore, the elevated levels of IgG in the brain parenchyma in young VE-cad-CreER;TGFBR1 CKO/- mice (Figure 8A) likely represents specific transport of IgG across the BBB.  Such transport is known to occur via Fc receptors expressed on vascular endothelial cells, although it is normally greater in the brain-to-blood direction than in the blood-to-brain direction.  For example, see Lafrance-Vanasse et al (2025) Leveraging neonatal Fc receptor (FcRn) to enhance antibody transport across the blood brain barrier.  Nat Commun. 16:4143.  This is now described in greater detail in the Results section.

      (5) The authors should perform a more detailed RNAseq analysis of tip and stack (stalk) cells in TgfbrR1 KO mice to determine whether D tip cells are lost in these mutants by snRNAseq. 

      The proposed snRNAseq analysis would serve as an independent check on the diving (D) tip vs stalk cell analyses published by Zarkada et al, who analyzed the same VE-cad-CreER;TGFBR1 CKO/- mutant mice, although they refer to the TGFBR1 gene by its alternate name ALK5 [Zarkada et al (2021) Specialized endothelial tip cells guide neuroretina vascularization and blood-retina-barrier formation. Dev Cell 56:2237-2251].  We have not gone in this direction because the question of tip vs. stalk cells and of subtypes of tip cells in WT vs. mutant retinas is beyond our focus on choroidal neovascularization and the role of immune cells and vascular inflammation.  The proposed snRNAseq analysis would also require a major effort since tip cells are rare and must be harvested from large numbers of early postnatal retinas followed by FACS enrichment for vascular endothelial cells.

      Suggestions for improving the manuscript:  

      (6) The statement that ECs acquire properties of immune cells (Page 2, Line 90) is incorrect. Endothelial cells may acquire characteristics of antigen presenting cells. 

      Thank you for that correction.  Based on the review from Amersfoort et al (2022) (Amersfoort J, Eelen G, Carmeliet P. (2022) Immunomodulation by endothelial cells - partnering up with the immune system? Nat Rev Immunol 22:576-588) and the articles cited in it, we have changed the sentence to “Although vascular endothelial cells (ECs) are not generally considered to be part of the immune system, in some locations and under some conditions they acquire properties characteristic of immune cells, including secretion of cytokines, surface display of co-stimulatory or co-inhibitory receptors, and antigen presentation in association with MHC class II proteins (Pober and Sessa, 2014; Amersfoort et al., 2022).”  

      (7) The statement in Page 3, Line 100-101 [In CNS ECs, quiescence is maintained in part by the actions of astrocyte-derived Sonic Hedgehog, with the result that few immune cells other than resident microglia are found within the CNS (Alvarez et al., 2011).] is incomplete. Wnt signaling also suppresses the expression of leukocyte adhesion molecules from endothelial cells and therefore helps with immune cell quiescence. 

      Thank you for raising that point.  We have expanded that sentence to include Wnt signaling in CNS endothelial cells, as described in the following reference: Lengfeld JE, Lutz SE, Smith JR, Diaconu C, Scott C, Kofman SB, Choi C, Walsh CM, Raine CS, Agalliu I, Agalliu D. (2017) Endothelial Wnt/beta-catenin signaling reduces immune cell infiltration in multiple sclerosis. Proc Natl Acad Sci USA 114:E1168-E1177.

      (8) It may be beneficial for the reader to separate the results of the vascular phenotypes related to choroidal neovascularization compared to retinal vascular development. 

      Thank you for this suggestion.  The two topics are partly overlapping: choroidal neovascularization is described in Figure 1, and retinal development is described in Figures 1 and 2.  The challenge is that some of same images illustrate both phenotypes as in Figure 1, so the topics cannot be easily separated.

      (9) In addition to comparing the phenotypes in Tgfb signaling mutant mice with Wnt signaling and VEGF-A signaling mutants, the authors should compare and contrast their data with those found in Alk5 KO mice, as there are a lot of similarities. 

      The reviewer has alerted us to a nomenclature challenge which we will try to resolve in the introduction: Alk5 is just another name for TGFBR1.  The reviewer is correct: there are a lot of similarities between the present study and that of Zarkada et al (2021) because both use the same TGFBR1(=Alk5) CKO mice.

      Reviewer #2 (Recommendations for the authors): 

      Figure 2 

      For 2B, the authors should clarify whether the two regions shown in the Tgfbr1 KO retina (P14) represent central vs. peripheral areas, as phenotype severity varies. 

      For 2C, does the uneven biotin accumulation reflect developmental gradients (e.g., central-peripheral maturation timing)? 

      Thank you for raising these points.  Regarding Figure 2B, these images are all from the mid-peripheral retina, where the phenotype is moderately severe.  This is now noted in the figure legend.

      Regarding Figure 2C, the reviewer is correct that the pattern of Sulfo-NHS-biotin is uneven in VEcadCreER;Tgfbr1CKO/- retinas – it accumulates only in the tufts.  We have expanded Figure 2C to show a comparison between control (i.e.

      phenotypically WT), NdpKO, and TGFBR1 endothelial KO retinas, and we have expanded the associated part of the Results section.  In a nutshell, control retinas show little Sulfo-NHS-biotin accumulation in the vasculature or in the parenchyma; NdpKO retinas show Sulfo-NHS-biotin accumulation in the vasculature and in the parenchyma (i.e., the area between the vessels); and VEcadCreER;Tgfbr1CKO/- retinas show Sulfo-NHS-biotin accumulation in the vascular tufts with minimal accumulation in the non-tuft vasculature and minimal leakage into the parenchyma.   The conclusion is that the bulk of the retinal vasculature in TGFBR1 endothelial KO mice is not leaky – very different from the situation with loss of Norrin/Frizzled4 signaling.

      Figure 6 

      The claim that PECAM1+ rings on veins reflect EC-immune cell binding is uncertain, as PECAM1 is also known to be expressed by immune cells. The complete correlation of PECAM1 and CD45 staining signals suggests that a subset of immune cells upregulates PECAM1. The VEcadCreER;Tgfbr1 flox/-; SUN1:GFP reporter would be helpful to delineate ECimmune cell proximity. Super-resolution imaging with Z-stacks could also resolve spatial relationships (luminal vs. abluminal immune cell adhesion). 

      Thank you for this comment.  The reviewer is correct that, at the resolution of these images, we cannot determine whether the PECAM1 immunostaining signal is derived from ECs, from leukocytes, or from both.  This is now stated in the Results section.  The PECAM1-rich endothelial ring structure associated with leukocyte extravasation has been characterized in various publications, for example in (1) Carman CV, Springer TA. (2004) A transmigratory cup in leukocyte diapedesis both through individual vascular endothelial cells and between them. J Cell Biol 167:377-388 and (2) Mamdouh Z, Mikhailov A, Muller WA. (2009) Transcellular migration of leukocytes is mediated by the endothelial lateral border recycling compartment. J Exp Med 206:2795-2808.  The ring structures visualized in Figure 6D by PECAM1 immunostaining conform to the ring structures described in these and other papers.  In showing these structures, our point is simply that they likely represent sites of leukocyte extravasation.  This is now clarified in the text.  We have also added some additional references on leukocyte extravasation and the ring structures.

      Figure 7 

      A time-course analysis of ICAM1 would strengthen the mechanistic model. Does ICAM1 upregulation precede immune infiltration (supporting inflammation as the primary defect)? Given that immune cells appear by P14 (per snRNA-seq), is ICAM1 elevated earlier? 

      This is an interesting idea, but based on what is known about leukocyte adhesion and extravasation we predict that there will not be a clean temporal separation between ICAM1 induction and leukocyte adhesion/infiltration.  That is, if the proinflammatory state causes an increase in the number of leukocytes, then as ICAM1 levels increase, leukocyte adhesion would also increase.  Similarly, if the presence of leukocytes increases the pro-inflammatory state, then as the number of leukocytes increases, the levels of ICAM1 would be predicted to increase.  Thus, we think that a time course analysis is unlikely to provide a definitive conclusion.

      Figure 8-SF1 

      In brain slices, a transient pan-IgG accumulation suggests a self-resolving defect in the BBB. However, this BBB impairment appears to be spatiotemporally distinct from ICAM1 upregulation. ICAM1 staining is restricted to the lesion site, aligning with immune cell-driven inflammation. 

      Thank you for raising these points.  The reviewer is correct that these observations don’t fit together in a clear way.  There does not appear to be a general increase in brain vascular permeability in VE-cad-CreER;TGFBR1 CKO/- mice, as shown by sulfo-NHS-biotin.  However, there is a large and transient increase in IgG in the brain parenchyma, suggestive of a general vascular alteration, and – as the reviewer correctly notes – it is not accompanied by a generalized increase in ICAM1 vascular immunostaining.  At this point, we don’t have any real insight into the mechanistic basis of the transient IgG increase.

      Thank you for handling this manuscript.

    1. Reviewer #1 (Public review):

      Summary:

      Zhang et al. addressed the question of whether advantageous and disadvantageous inequality aversion can be vicariously learned and generalized. Using an adapted version of the ultimatum game (UG), in three phases, participants first gave their own preference (baseline phase), then interacted with a "teacher" to learn their preference (learning phase), and finally were tested again on their own (transfer phase). The key measure is whether participants exhibited similar choice preference (i.e., rejection rate and fairness rating) influenced by the learning phase, by contrasting their transfer phase and baseline phase. Through a series of statistical modeling and computational modeling, the authors reported that both advantageous and disadvantageous inequality aversion can indeed be learned (Study 1), and even be generalised (Study 2).

      Strengths:

      This study is very interesting, that directly adapted the lab's previous work on the observational learning effect on disadvantageous inequality aversion, to test both advantageous and disadvantageous inequality aversion in the current study. Social transmission of action, emotion, and attitude have started to be looked at recently, hence this research is timely. The use of computational modeling is mostly appropriate and motivated. Study 2 that examined the vicarious inequality aversion on conditions where feedback was never provided is interesting and important to strengthen the reported effects. Both studies have proper justifications to determine the sample size.

      Weaknesses:

      Despite the strengths, a few conceptual aspects and analytical decisions have to be explained, justified, or clarified.

      INTRODUCTION/CONCEPTUALIZATION

      (1) Two terms seem to be interchangeable, which should not, in this work: vicarious/observational learning vs preference learning. For vicarious learning, individuals observe others' actions (and optionally also the corresponding consequence resulted directly by their own actions), whereas, for preference learning, individuals predict, or act on behalf of, the others' actions, and then receive feedback if that prediction is correct or not. For the current work, it seems that the experiment is more about preference learning and prediction, and less so about vicarious learning. But the intro and set are heavily around vicarious learning, and late the use of vicarious learning and preference learning is rather mixed in the text. I think either tone down the focus on vicarious learning, or discuss how they are different. Some of the references here may be helpful: Charpentier et al., Neuron, 2020; Olsson et al., Nature Reviews Neuroscience, 2020; Zhang & Glascher, Science Advances, 2020

      EXPERIMENTAL DESIGN

      (2) For each offer type, the experiment "added a uniformly distributed noise in the range of (-10 ,10)". I wonder how this looks like? With only integers such as 25:75, or even with decimal points? More importantly, is it possible to have either 70:30 or 90:10 option, after adding the noise, to have generated an 80:20 split shown to the participants? If so, for the analyses later, when participants saw the 80:20 split, which condition did this trial belong to? 70:30 or 90:10? And is such noise added only to the learning phase, or also to the baseline/transfer phases? This requires some clarification.

      (3) For the offer conditions (90:10, 70:30, 50:50, 30:70, 10:90) - are they randomized? If so, how is it done? Is it randomized within each participants, and/or also across participants (such that each participant experienced different trial sequences)? This is important, as the order especially for the leanring phase can largely impact on the preference learning of the participants.

      STATISTICAL ANALYSIS & COMPUTATIONAL MODELING

      (4) In Study 1 DI offer types (90:10, 70:30), the rejection rate for DI-AI averse looks consistently higher than that for DI averse (ie, blue line is above the yellow line). Is this significant? If so, how come? Since this is a between-subject design, I would not anticipate such a result (especially for the baseline). Also, for the LME results (eg, Table S3), only interactions were reported but not the main results.

      (5) I do not particularly find this analysis appealing: "we examined whether participants' changes in rejection rates between Transfer and Baseline, could be explained by the degree to which they vicariously learned, defined as the change in punishment rates between the first and last 5 trials of the Learning phase." Naturally, participants' behavior in the first 5 trials in the learning phase will be similar to those in the baseline; and their behavior in the last 5 trials in the learning phase would echo those at the transfer phase. I think it would be stronger to link the preference learning results to the chance between baseline and transfer phase, eg, by looking at the difference between alpha (beta) at the end of the learning phase and the initial alpha (beta).

      (6) I wonder if data from the baseline and transfer phases can also be modeled, using a simple Fehr-Schimdt model? This way, the change in alpha/beta can also be examined between the baseline and transfer phase.

      (7) I quite liked Study 2 that tests the generalization effect, and I expected to see an adapted computational modeling to directly reflect this idea. Indeed, the authors wrote "[...] given that this model [...] assumes the sort of generalization of preferences between offer types [...]". But where exactly did the preference learning model assumed the generalization? In the methods, the modeling seems to be only about Study 1; did the authors advise their model to accommodate Study 2? The authors also ran simulation for the learning phase in Study 2 (Figure 6), and how did the preference updated (if at all) for offers (90:10 and 10:90) where feedback was not given? Extending/Unpacking the computational modeling results for Study2 will be very helpful for the paper.

      Comments on revisions:

      I kept my original public review, so that future readers can see the progress and development of the manuscript.

      The authors have largely addressed my original questions/concerns, and I have two outstanding comments.

      (a) Related to my original comment #6, where I suggested to apply the F-S model also to the baseline and transfer phase. The authors were inclined not to do it, but in fact later in comment #7 and in the manuscript they opted to use a more complex F-S-based model to their learning phase. I agree that the rejection rate is indeed a clear indication, but for completeness, it'd be more consistent and compelling if the paper follows a model-free (model-agnostic) and model-based approach in all phases of the experiment.

      (b) Related to my original comment #4, I appreciate that the authors have provided more details of their LMM models. But I don't think it is accurate regardless. First, all offer levels (50:50, 30:70, 10:90), should not be coded as pure categorical levels. In fact, they have an ordinal meaning, a single ordinal predictor with three levels should be used. This also avoids the excessive number of interactions the authors have pointed out.

      Second, running a model with only interactions without main effects is flawed. All textbooks on stats emphasize that without the presence of the main effects, the interpretation of interaction only is biased.

      So these LMMs needs to be revised before the manuscript eventually gets to a version of record.

    2. Reviewer #2 (Public review):

      Summary:

      This study investigates whether individuals can learn to adopt egalitarian norms that incur a personal monetary cost, such as rejecting offers that benefit them more than the giver (advantageous inequitable offers). While these behaviors are uncommon, two experiments aim to demonstrate that individuals can learn to reject such offers by observing a "teacher" who follows these norms. The authors use computational modelling to argue that learners adopt these norms through a sophisticated process, inferring the latent structure of the teacher's preferences, akin to theory of mind.

      Strengths:

      This paper is well-written and tackles an important topic relevant to social norms, morality, and justice. The findings are promising (though further control conditions are necessary to support the conclusions). The study is well-situated in the literature, with a clever experimental design and a computational approach that may offer insights into latent cognitive processes. In the revision, the authors clarified some questions related to the initial submission.

      Weaknesses:

      Despite these strengths, I remain unconvinced that the current evidence supports the paper's central claims. Below, I outline several issues that, in my view, limit the strength of the conclusions.

      (1) Experimental Design and Missing Control Condition:

      The authors set out to test whether observing a "teacher" who is averse to advantageous inequity (Adv-I) will affect observers' own rejection of Adv-I offers. However, I think the design of the task lacks an important control condition needed to address this question. At present, participants are assigned to one of two teachers: DIS or DIS+ADV. Behavioral differences between these groups can only reveal relative differences in influence; they cannot establish whether (and how) either teacher independently affects participants' own behavior. For example, a significant difference between conditions can emerge even if participants are only affected by the DIS teacher and are not affected at all by the DIS+ADV teacher. What is crucially missing here is a no-teacher control condition, which can then be compared with each teacher condition separately. This control condition would also control for pure temporal effects unrelated to teacher influence (e.g., increasing Adv-I rejections due to guilt build-up).

      While this criticism applies to both experiments, it is especially apparent in Experiment 2. As shown in Figure 4, the interaction for 10:90 offers reflects a decrease in rejection rates following the DIS teacher, with no significant change following the DIS+ADV teacher. Ignoring temporal effects, this pattern suggests that participants may be learning NOT to reject from the DIS teacher, rather than learning to reject from the DIS+ADV teacher. On this basis, I do not see convincing evidence that participants' own choices were shaped by observing Adv-I rejections.

      In the Discussion, the authors write that "We found that participants' own Adv-I-averse preferences shifted towards the preferences of the Teacher they just observed, and the strength of these contagion effects related to the degree of behavior change participants exhibited on behalf of the Teachers, suggesting that they internalized, at least somewhat, these inequity preferences." However, there is no evidence that directly links the degree of behaviour change (on the teacher's behalf) to contagion effects (own behavioural change). I think there was a relevant analysis in the original version, but it was removed from the current version.

      (2) Modelling Efforts: The modelling approach is underdeveloped. The identification of the "best model" lacks transparency, as no model-recovery results are provided. Additionally, behavioural fits for the losing models are not shown, leaving readers in the dark about where these models fail. Readers would benefit from seeing qualitative/behavioural patterns that favour the winning model. Moreover, the reinforcement learning (RL) models used are overly simplistic, treating actions as independent when they are likely inversely related. For example, the feedback that the teacher would have rejected an offer provides evidence that rejection is "correct" but also that acceptance is "an error," and the latter is not incorporated into the modelling. In other words, offers are modelled as two-armed bandits (where separate values are learned for reject and accept actions), but the situation is effectively a one-armed bandit (if one action is correct, the other is mistaken). It is unclear to what extent this limitation affects the current RL formulations. Can the authors justify/explain their reasoning for including these specific variants? The manuscript only states Q-values for reject actions, but what are the Q-values for accept actions? This is unclear.

      In Experiment 2, only the preferred model is capable of generalization, so it is perhaps unsurprising that this model "wins." However, this does not strongly support the proposed learning mechanism, lacking a comparison with simpler generalizing mechanisms (see following comments).

      (3) Conceptual Leap in Modelling Interpretation: The distinction between simple RL models and preference-inference models seems to hinge on the ability to generalize learning from one offer to another. Whereas in the RL models, learning occurs independently for each offer (hence no cross-offer generalization), preference inference allows for generalization between different offers. However, the paper does not explore "model-free" RL models that allow generalization based on the similarity of features of the offers (e.g., payment for the receiver, payment for the offer-giver, who benefits more). Such models are more parsimonious and could explain the results without invoking a theory of mind or any modelling of the teacher. In such model versions, a learner acquires a functional form that allows prediction of the teacher's feedback based on offer features (e.g., linear or quadratic weighting). Because feedback for an offer modulates the parameters of this function (feature weights), generalization occurs without necessarily evoking any sophisticated model of the other person. This leaves open the possibility that RL models could perform just as well or even outperform the preference learning model, casting doubt on the authors' conclusions.

      Of note: even the behaviourists knew that when Little Albert was taught to fear rats, this fear generalized to rabbits. This could occur simply because rabbits are somewhat similar to rats. But this doesn't mean Little Albert had a sophisticated model of animals that he used to infer how they behave.

      In their rebuttal letter, the authors acknowledge these possibilities, but the manuscript still does not explore or address alternative mechanisms.

      (4) Limitations of the Preference-Inference Model: The preference-inference model struggles to capture key aspects of the data, such as the increase in rejection rates for 70:30 DI offers during the learning phase (e.g., Fig. 3A, AI+DI blue group). This is puzzling. Thinking about this, I realized the model makes quite strong, unintuitive predictions which are not examined. For example, if a subject begins the learning phase rejecting the 70:30 offer more than 50% of the time (meaning the starting guilt parameter is higher than 1.5), then, over learning, the tendency to reject will decrease to below 50% (the guilt parameter will be pulled down below 1.5). This is despite the fact that the teacher rejects 75% of the offers. In other words, as learning continues, learners will diverge from the teacher. On the other hand, if a participant begins learning by tending to accept this offer (guilt < 1.5), then during learning, they can increase their rejection rate but never above 50%. Thus, one can never fully converge on the teacher. I think this relates to the model's failure in accounting for the pattern mentioned above. I wonder if individuals actually abide by these strict predictions. In any case, these issues raise questions about the validity of the model as a representation of how individuals learn to align with a teacher's preferences (given that the model doesn't really allow for such an alignment).

      In their rebuttal letter, the authors acknowledged these anomalies and stated that they were able to build a better model (where anomalies are mitigated, though not fully eliminated). But they still report the current model and do not develop/discuss alternatives. A more principled model may be a Bayesian model where participants learn a belief distribution (rather than point estimates) regarding the teacher's parameters.

      (5) Statistical Analysis: The authors state in their rebuttal letter that they used the most flexible random effect structure in mixed-effects models. But this seems not to be the case in the model reported in Table SI3 (the very same model was used for other analyses too). Indeed, here it seems only intercepts are random effects. This left me confused about which models were used.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review): 

      Most human traits and common diseases are polygenic, influenced by numerous genetic variants across the genome. These variants are typically non-coding and likely function through gene regulatory mechanisms. To identify their target genes, one strategy is to examine if these variants are also found among genetic variants with detectable effects on gene expression levels, known as eQTLs. Surprisingly, this strategy has had limited success, and most disease variants are not identified as eQTLs, a puzzling observation recently referred to as "missing regulation". 

      In this work, Jeong and Bulyk aimed to better understand the reasons behind the gap between disease-associated variants and eQTLs. They focused on immune-related diseases and used lymphoblastoid cell lines (LCLs) as a surrogate for the cell types mediating the genetic effects. Their main hypothesis is that some variants without eQTL evidence might be identifiable by studying other molecular intermediates along the path from genotype to phenotype. They specifically focused on variants that affect chromatin accessibility, known as caQTLs, as a potential marker of regulatory activity. 

      The authors present data analyses supporting this hypothesis: several disease-associated variants are explained by caQTLs but not eQTLs. They further show that although caQTLs and eQTLs likely have largely overlapping underlying genetic variants, some variants are discovered only through one of these mapping strategies. Notably, they demonstrate that eQTL mapping is underpowered for gene-distal variants with small effects on gene expression, whereas caQTL mapping is not dependent on the distance to genes. Additionally, for some disease variants with caQTLs but no corresponding eQTLs in LCLs, they identify eQTLs in other cell types. 

      Altogether, Jeong and Bulyk convincingly demonstrate that for immune-related diseases, discovering the missing disease-eQTLs requires both larger eQTL studies and a broader range of cell types in expression assays. It remains to be seen what fractions of the missing diseaseeQTLs will be discovered with either strategy and whether these results can be extended to other diseases or traits. 

      We thank the reviewer for their accurate summary of our study and positive review of our findings for immune-related diseases.

      It should be noted that the problem of "missing regulation" has been investigated and discussed in several recent papers, notably Umans et al., Trends in Genetics 2021; Connally et al., eLife 2022; Mostafavi et al., Nat. Genet. 2023. The results reported by Jeong and Bulyk are not unexpected in light of this previous work (all of which they cite), but they add valuable empirical evidence that mostly aligns with the model and discussions presented in Mostafavi et al. 

      We thank the reviewer for their positive review of our results and manuscript. As Reviewer #1 noted, whether our and others' observation extends to other diseases or traits is an open question. For instance, Figure 2b in Mostafavi et al., Nat. Genet. (2023) demonstrated that there was a spectrum of depletion of eQTLs and enrichment of GWAS signals in constrained genes across various tissues and traits, respectively. Therefore, gene expression constraint may play a larger or smaller role in different diseases or traits. That immune cell types and cell states are extremely diverse (Schmiedel et al., Cell (2018) and Calderon et al., Nat. Genet. (2019), just to name a few) likely adds to the complexity of gene regulation that contributes to immune-mediated disease.

      Reviewer #2 (Public Review): 

      Summary: 

      eQTLs have emerged as a method for interpreting GWAS signals. However, some GWAS signals are difficult to explain with eQTLs. In this paper, the authors demonstrated that caQTLs can explain these signals. This suggests that for GWAS signals to actually lead to disease phenotypes, they must be accessible in the chromatin. This implies that for GWAS signals to translate into disease phenotypes, they need to be accessible within the chromatin. 

      However, fundamentally, caQTLs, like GWAS, have the limitation of not being able to determine which genes mediate the influence on disease phenotypes. This limitation is consistent with the constraints observed in this study. 

      We thank the reviewer for their accurate summary of our results.

      (1) For reproducibility, details are necessary in the method section.

      Details about adding YRI samples in ATAC-seq: For example, how many samples are there, and what is used among public data? There is LCL-derived iPSC and differentiated iPSC (cardiomyocytes) data, not LCL itself. How does this differ from LCL, and what is the rationale for including this data despite the differences?

      Banovich et al., Genome Research (2018) (PMID: 29208628), who generated data using LCLderived iPSCs and differentiated iPSCs (cardiomyocytes), also generated ATAC-seq data from 20 YRI LCL samples. We analyzed those data to identify open chromatin regions (i.e., ATACseq peaks) in LCLs and merged the regions with open chromatin regions identified with 100 GBR LCL samples from two studies by Kumasaka et al. (Nature Genetics (2016)

      PMID: 26656845 and Nature Genetics (2019) PMID: 30478436). However, we restricted the caQTL analysis to only the 100 GBR samples because of possible ancestry effects and batch effects. We attempted caQTL analysis with the 20 YRI samples as well, but the result was noisy, likely due to smaller sample size and lower read depth of the ATAC-seq data.

      caQTL is described as having better power than eQTL despite having fewer samples. How does the number of ATAC peaks used in caQTL compare to the number of gene expressions used in eQTL?

      The number of ATAC peaks used in caQTL (99,320) is ~6.7 times greater than the number of genes (14,872) used in the eQTL analysis. Therefore, there is a higher chance of detecting a significant caQTL signal and a significant colocalization signal than there is for eQTLs. However, we reasoned that since distal eQTLs are more easily detected as caQTLs and since increasing the sample size of eQTLs through meta-analysis uncovered additional eQTL colocalization at loci with caQTL colocalization only, colocalized caQTLs are likely capturing disease-relevant regulatory effects.

      Details about RNA expression data: In the method section, it states that raw data (ERP001942) was accessed, and in data availability, processed data (E-GEUV-1) was used. These need to be consistent.

      Thank you for pointing this out. We used the processed data from Expression Atlas (https://www.ebi.ac.uk/gxa/experiments/E-GEUV-1/Results), and that's what we meant by "We downloaded RNA expression level data of the LCL samples from the Expression Atlas." We have revised the “RNA expression data preparation” section in our manuscript to make the text clearer.

      How many samples were used (the text states 373, but how was it reduced from the original 465, and the total genotype is said to be 493 samples while ATAC has n=100; what are the 20 others?), and it mentions European samples, but does this exclude YRI?

      We thank the reviewer for pointing out these points of confusion. Our reported count of 493 samples included YRI samples with RNA-seq data or ATAC-seq data that we ultimately did not use for QTL analyses. There were 373 European samples with RNA-seq data that we used for eQTL analysis, and 100 GBR samples (including some that overlap with the 373 European samples) that we used for caQTL analysis. We have revised the text to clarify these points.

      (2) Experimental results determining which TFs might bind to the representative signals of caQTL are required.

      We agree that caQTL colocalization is just the start of elucidating the regulatory mechanism of a GWAS locus. Determining which TFs are bound and which TFs' binding is altered would be necessary to describe the causal regulatory mechanism. For this, we utilized the Cistrome database to search for TFs whose binding overlaps the colocalized caQTL peaks. We present the results of this analysis in Supplementary Table 3 and Supplementary Figure 4, both of which we have added in our revised manuscript. Overall, protein factors associated with active transcription, such as POL2RA, and several immune cell TFs, including RUNX3, SPI1, and RELA, were frequently detected in those peaks. Detecting these factors in most peaks supports the likelihood that the colocalized caQTL peaks are active cis-regulatory elements. These results are consistent with our observation of enriched caQTL-mediated heritability in regions with active histone marks (Figure 1).

      (3) It is stated that caQTL is less tissue-specific compared to eQTL; would caQTL performed with ATAC-seq results from different cell types, yield similar results?

      We thank the reviewer for the question. Calderon et al. (PMID: 31570894) observed that "most effects on allelic imbalance (of ATAC-seq) were shared regardless of lineage or condition". Yet, there were regions where a different cell type or state would show inaccessibility (Figure 4d in Calderon et al.). Thus, we expect that ATAC-seq results from different cell types (e.g., T cells, B cells, monocytes, etc.) would lead to additional caQTLs showing colocalization at cell-typespecific open chromatin. However, if a region is accessible in both cell types, caQTL may be detected in both. Moreover, Alasoo et al., Nature Genetics (2018) (PMID: 29379200) observed that “many disease-risk variants affect chromatin structure in a broad range of cellular states, but their effects on expression are highly context specific.” In both studies, the authors investigated immune cell types, and there could be different observations in non-immune cell types and other diseases and traits.

      Reviewer #1 (Recommendations For The Authors): 

      I think it would strengthen the paper to explore gene-level differences in the discovery of caQTLs and eQTLs. For example, complex disease-relevant genes, on average, have more/longer regulatory domains (as shown by Wang and Goldstein, AJHG 2020; Mostafavi et al., Nat. Genet. 2023). Therefore, it is plausible that for such genes, caQTLs are much more easily discoverable than eQTLs due to (i) a larger mutational target size for caQTLs, and (ii) dispersion of expression heritability across multiple domains, which hampers the discovery of eQTLs but not caQTLs, which are studied independently of other domains in the region. In other words, discovered caQTLs and eQTLs likely vary in terms of their distance to genes (as the authors report), as well as their target genes.

      We thank the reviewer for the suggestion to explore gene-level differences. We expect that the effects of complex disease-relevant genes having more / longer regulatory domains, on average, to explain our observations. We agree on both of your points that there are many more regulatory elements that are captured as accessible regions than expressed genes and that genes often have multiple independent eQTLs leading to dispersion of heritability. The genelevel trend that we described was the distance of the regulatory element from the genes. Additional analyses would be a relevant future direction.

      Also considering gene-level analysis, Mostafavi et al. show that the types of biases they report for eQTLs also apply to other molecular QTLs. It would be valuable to compare GWAS hits with versus without caQTL colocalization. Similarly, it would be insightful to compare GWAS hits with both colocalized caQTLs and eQTLs to GWAS hits with colocalized caQTLs but no eQTLs in any of the cell types. 

      We thank the reviewer for the comment. Investigating for potential biases in the colocalized caQTL would be useful, but we considered it beyond the scope of this work. In terms of biological factors, we demonstrated through mediated heritability analyses that more accessible chromatin (based on ATAC-seq read coverage) and regions with active histone marks were enriched for autoimmune disease associations (Figure 1). Furthermore, as greater distance of the regulatory variant from the transcription start site significantly reduced the cis-heritability, we would expect that distance would play a major role, similar to Mostafavi et al.’s conclusions.

      I don't think the argument for the role of natural selection contributing to the "missing regulation" is presented accurately. Specifically, large eQTLs acting on top trait-relevant genes are under stronger selection and thus, on average, segregate at lower frequencies. This makes them difficult to discover in eQTL assays. However, if not lost, they contribute as much, if not more, to trait heritability than weaker eQTLs at the same gene because their larger effects compensate for their lower frequency. At the most extreme, selection should have a "flattening" effect (e.g., see Simons et al., PLOS Biol 2018; O'Connor et al., AJHG 2019): weak and strong eQTLs at the same gene are expected to contribute equally to heritability. Therefore, the statement "Consequently, only weak eQTL variants, often in regions distal to the gene's promoter, may remain and affect traits" is not correct. If this turns out to be empirically true, other models, such as pleiotropic selection, need to explain it. 

      We thank the reviewer for the correction. We agree with the comment and have revised the sentences in the introduction accordingly.

      It is worth speculating why caQTLs may be more consistent across cell types than cis-eQTLs. Additionally, readers may infer from the paper that the focus should shift from eQTLs to caQTLs, which may not be the authors' intention. Perhaps these approaches are complementary: caQTLs can help with TSS-distal disease variants, while finding the target gene and regulatory context is more straightforward with eQTL colocalization. Addressing these points in the discussion will be helpful.

      We appreciate the reviewer's suggestion to clarify the advantages of incorporating cis-eQTLs and caQTLs. Our argument is exactly as you put it, and we added a paragraph on this in the Discussion.

      I believe the authors could do more to contextualize their findings within the existing literature on the subject, particularly Umans et al., Trends in Genetics 2021; Connally et al., eLife 2022; and Mostafavi et al., Nat. Genet. 2023. For instance, Umans et al. suggest that "if most standard eQTLs are generally benign, increasing sample size and adding more tissue types in an effort to identify even more standard eQTLs may not help us to explain many more disease risk mutations". Conversely, Mostafavi et al. argue for a multipronged approach, which appears more aligned with the authors' conclusions.

      We followed the reviewer’s suggestion to place our work in the context of existing literature on this topic. Moreover, we clarified what our recommendations for future data generation are.

      I thought Figures 1C-D were unclear. 

      We added a sentence in the figure legend describing that stronger and more significant enrichment indicate that mediated heritability is concentrated in that subset.

      Reviewer #2 (Recommendations For The Authors): 

      Complete workflow figures for caQTL calling and eQTL calling are required. 

      To improve clarity of the caQTL and eQTL calling workflow, we added Supplementary Figure 1.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review): 

      In this manuscript, Chen et al. investigate the role of the membrane estrogen receptor GPR30 in spinal mechanisms of neuropathic pain. Using a wide variety of techniques, they first provide convincing evidence that GPR30 expression is restricted to neurons within the spinal cord, and that GPR30 neurons are well-positioned to receive descending input from the primary sensory cortex (S1). In addition, the authors put their findings in the context of the previous knowledge in the field, presenting evidence demonstrating that GRP30 is expressed in the majority of CCK-expressing spinal neurons. Overall, this manuscript furthers our understanding of neural circuity that underlies neuropathic pain and will be of broad interest to neuroscientists, especially those interested in somatosensation. Nevertheless, the manuscript would be strengthened by additional analyses and clarification of data that is currently presented. 

      Strengths: 

      The authors present convincing evidence for the expression of GPR30 in the spinal cord that is specific to spinal neurons. Similarly, complementary approaches including pharmacological inhibition and knockdown of GPR30 are used to demonstrate the role of the receptor in driving nerve injury-induced pain in rodent models. 

      Weaknesses: 

      Although steps were taken to put their data into the broader context of what is already known about the spinal circuitry of pain, more considerations and analyses would help the authors better achieve their goal. For instance, to determine whether GPR30 is expressed in excitatory or inhibitory neurons, more selective markers for these subtypes should be used over CamK2. Moreover, quantitative analysis of the extent of overlap between GPR30+ and CCK+ spinal neurons is needed to understand the potential heterogeneity of the GPR30 spinal neuron population, and to interpret experiments characterizing descending SI inputs onto GPR30 and CCK spinal neurons. Filling these gaps in knowledge would make their findings more solid. 

      Thank you very much for your constructive feedback.

      In response to your suggestion, we have used more specific markers to distinguish excitatory (VGLUT2) and inhibitory (VGAT) neurons via in situ hybridization. These analyses revealed that GPR30 is predominantly expressed in excitatory neurons of the superficial dorsal horn (SDH), as presented in the Results section (lines 117-120) and in Figure 2A-B.

      Additionally, we performed a quantitative analysis to determine the extent of co-localization between GPR30+ and CCK+ neurons. The data were included in the Results (lines 131–132) and Figure 2G.

      Reviewer #2 (Public review):

      Using a variety of experimental manipulations, the authors show that the membrane estrogen receptor G protein-coupled estrogen receptor (GPER/GPR30) expressed in CCK+ excitatory spinal interneurons plays a major role in the pain symptoms observed in the chronic constriction injury (CCI) model of neuropathic pain. Intrathecal application of selective GPR30 agonist G-1 induced mechanical allodynia and thermal hyperalgesia in male and female mice. Downregulation of GPR30 in CCK+ interneurons prevented the development of mechanical and thermal hypersensitivity during CCI. They also show the up modulation of AMPA receptor expression by GPR30. 

      Generally, the conclusions are supported by the experimental results. I also would like to see significant improvements in the writing and the description of results. 

      Methodological details for some of the techniques are rather sparse. For example, when examining the co-localization of various markers, the authors do not indicate the number of animals/sections examined. Similarly, when examining the effect of shGper1, it is unclear how many cells/sections/animals were counted and analyzed. 

      In other sections, there is no description of the concentration of drugs used (for example, Figure 4H). In Figures 4C-E, there is no indication of the duration of the recordings, the ionic conditions, the effect of glutamate receptor blockers, etc 

      Some results appear anecdotal in the way they are described. For example, in Figure 5, it is unclear how many times this experiment was repeated. 

      We sincerely appreciate your valuable feedback and thoughtful recommendations.

      To address your concerns regarding methodological transparency, we have added the following details to the revised manuscript:

      The number of animals and sections analyzed in co-localization studies.

      The number of cells/sections/animals used in each quantification following shGper1 treatment.

      The concentrations of drugs administered (e.g., in Figure 4H).

      Detailed recording conditions, including duration, ionic composition, and pharmacological conditions (Figures 4C-E).

      In addition, we have thoroughly revised the writing throughout the manuscript to enhance clarity and precision in the description of our findings.

      Reviewer #3 (Public review): 

      Summary: 

      The authors convincingly demonstrate that a population of CCK+ spinal neurons in the deep dorsal horn express the G protein-coupled estrogen receptor GPR30 to modulate pain sensitivity in the chronic constriction injury (CCI) model of neuropathic pain in mice. Using complementary pharmacological and genetic knockdown experiments they convincingly show that GPR30 inhibition or knockdown reverses mechanical, tactile, and thermal hypersensitivity, conditioned place aversion, and c-fos staining in the spinal dorsal horn after CCI. They propose that GPR30 mediates an increase in postsynaptic AMPA receptors after CCI using slice electrophysiology which may underlie the increased behavioral sensitivity. They then use anterograde tracing approaches to show that CCK and GPR30 positive neurons in the deep dorsal horn may receive direct connections from the primary somatosensory cortex. Chemogenetic activation of these dorsal horn neurons proposed to be connected to S1 increased nociceptive sensitivity in a GPR30-dependent manner. Overall, the data are very convincing and the experiments are well conducted and adequately controlled. However, the proposed model of descending corticospinal facilitation of nociceptive sensitivity through GPR30 in a population of CCK+ neurons in the dorsal horn is not fully supported. 

      Strengths: 

      The experiments are very well executed and adequately controlled throughout the manuscript. The data are nicely presented and supportive of a role for GPR30 signaling in the spinal dorsal horn influencing nociceptive sensitivity following CCI. The authors also did an excellent job of using complementary approaches to rigorously test their hypothesis. 

      Weaknesses: 

      The primary weakness in this manuscript involves overextending the interpretations of the data to propose a direct link between corticospinal projections signaling through GPR30 on this CCK+ population of spinal dorsal horn neurons. For example, even in the cropped images presented, GPR30 is present in many other CCK-negative neurons. Only about a quarter of the cells labeled by the anterograde viral tracing experiment from S1 are CCK+. Since no direct evidence is provided for S1 signaling through GPR30, this conclusion should be revised. 

      Thank you for your encouraging comments and critical insights.

      We fully acknowledge the concern regarding the proposed direct involvement of corticospinal projections in modulating nociceptive behavior via GPR30 in CCK+ neurons. While our anterograde tracing experiments suggest anatomical overlap, we agree that definitive evidence of functional connectivity is lacking.

      Accordingly, we have revised the Abstract, Discussion, and Graphical Abstract to present our findings more cautiously. We now describe our observations as indicating that S1 projections potentially interact with GPR30<sup>+</sup> spinal neurons, rather than asserting a definitive functional link.

      To support this revised interpretation, we performed additional quantitative analyses examining the co-localization among S1 projections, CCK+, and GPR30+ neurons. Furthermore, we clarified that the chemogenetic activation studies targeted a mixed neuronal population and did not exclusively manipulate CCK+ neurons.

      These changes aim to better align our conclusions with the presented data and provide a more nuanced framework for future investigations.

      Reviewer #1 (Recommendations for the authors): 

      Major corrections 

      (1) Figure 2: The authors conclude that GPR30 is mainly expressed in excitatory spinal neurons because they are labeled by a virus with a Camk2 promoter. While there is evidence that Camk2 is specific to excitatory neurons in the brain, based on RNAseq datasets (e.g. Linnarsson Lab, http://mousebrain.org/adolescent/genesearch.html ) this is less clear cut within the spinal cord. A more direct way to assess the relative expression of GPR30 in excitatory versus inhibitory neurons would be to perform immunohistochemistry or FISH with GPR30/Vglut2/Vgat. 

      Alternatively, if this observation is not crucial for the overall arch of the story, I recommend the authors eliminate these data, as they do not support the idea that GPR30 is mainly in excitatory neurons. 

      We thank the reviewer for highlighting this important limitation. To strengthen our conclusion regarding the neuronal identity of GPR30-expressing cells, we performed fluorescent in situ hybridization (FISH) using vGluT2 (marker for excitatory neurons) and VGAT (marker for inhibitory neurons). The results confirmed that GPR30 is predominantly expressed in vGluT2-positive excitatory neurons within the spinal cord. These new data are presented in the revised manuscript (lines 117-120) and shown in Figure 2A-B.

      (2) (2a) Figure 2: The authors also report that GPR30 is expressed in most CCK+ spinal neurons. A more rigorous way to present the data would be to perform quantification and report the % of CCK neurons that are GPR30. 

      (2b) More importantly, it is unclear what % of GPR30 neurons are CCK+. These types of quantifications would provide useful insights into the heterogeneity of CCK and GPR30 neuron populations, and help align findings of experiments using the behavioral pharmacology using GRP antagonists to the knockdown of Gper1 in CCK spinal neurons - for instance, does a population of GRP30+/CCK- neurons exist? If so, it would be worth discussing what role (if any) that population might play in nerve injury-induced mechanical allodynia. 

      Understanding the breakdown of GPR30 populations becomes even more relevant when the authors characterize which cell types are targeted by descending projections from S1. It is clear that the vast majority of CCK+ neurons that receive descending input from S1 neurons are GPR30+, but there are many other GPR30+ neurons that do not receive input from SI neurons presented in 5M. Is this simply because only a small fraction of CCK+/GPR30+ neurons are targeted by descending S1 projections, or could they represent a distinct population of GPR30 neurons? 

      (2a) We appreciate the suggestion. Quantification showed that approximately 90% of CCK⁺ neurons express GPR30, and about 50% of GPR30⁺ neurons co-express CCK. These data are now provided in the revised Results (lines 131-132) and in Figure 2F-G.

      (2b) Indeed, our data reveal that a substantial portion of GPR30⁺ neurons do not co-express CCK. While this study focuses on GPR30 function in CCK⁺ neurons, we recognize the potential relevance of GPR30⁺/CCK⁻ populations. We have addressed this point in the Discussion (lines 303-306):

      “However, it should be noted that half of GPR30⁺ neurons are not co-localized with CCK⁺ neurons, and further studies are needed to explore the function of these GPR30⁺/CCK⁻ neurons in neuropathic pain.”

      Regarding descending input, our data in Figure 5 show that S1 projections selectively innervate a subset (~30%) of CCK⁺ neurons, most of which co-express GPR30. This suggests that S1-targeted CCK⁺/GPR30⁺ neurons may represent a functionally distinct population. We have added clarification to the revised manuscript, while acknowledging that further studies are needed to elucidate the roles of non-targeted GPR30⁺ neurons.

      (3) Throughout the manuscript both male and female mice were used in experiments. Rather than referring to male and female mice as different genders, it would be more appropriate to describe them as different sexes. 

      As suggested, we have replaced all instances of “gender” with “sex” throughout the revised manuscript.

      (4) Figure 5: To increase the ease of interpreting the figure, in panels 5J and 5N, it would be helpful to indicate directly on the figure panel which another marker was assessed in double-labeling analyses.

      We have revised Figures 5J and 5N to include clear labels identifying the markers used in double-labeling analyses, to improve interpretability.

      Minor corrections: 

      (1) Line 36, I believe the authors mean to say "GPER/GPR30 in spinal neurons", rather than just "spinal". 

      Corrected as suggested. The sentence now reads (line 34):

      “Here we showed that the membrane estrogen receptor G-protein coupled estrogen receptor (GPER/GPR30) in spinal neurons was significantly upregulated in chronic constriction injury (CCI) mice…”

      (2) There are minor grammatical errors throughout the manuscript that interfere with comprehension. Proofreading/editing of the English language use may be beneficial. 

      We have thoroughly revised the manuscript for clarity and corrected grammatical and syntactic errors to improve readability.

      (3) Line 169-170, reads "Known that EPSCs are mediated by glutamatergic receptors like AMPA receptors and several studies have been reported the relationship between GPR30 and AMPA receptor25,29". Rewriting the sentence such that it better describes what the known relationship is between GPR30 and AMPA would be helpful in setting up the rationale of the experiment in Figure 4. 

      We have rewritten this section to better clarify the rationale behind the electrophysiological experiments (lines 161-164):

      “Given that EPSCs are primarily mediated through glutamatergic receptors such as AMPA receptors, and emerging evidence suggesting that GPR30 enhances excitatory transmission by promoting clustering of glutamatergic receptor subunits, we examined whether GPR30 modulates EPSCs via AMPA receptor-dependent mechanisms.”

      (4) Line 198-199 "Then we explored the possible connections among GPR30, S1-SDH projections and CCK+ neuron." In the context of spinal circuitry, "connections" may raise the expectation that synaptic connectivity will be evaluated. What I think best describes what the authors investigated in Figure 5 is the "relationship" between GPR30, S1-SDH projections, and CCK+ neurons. 

      We have revised the sentence accordingly (lines 184-186):

      “Building on previous findings suggesting a functional interaction between S1-SDH projections and spinal CCK⁺ neurons, our current study aimed to further elucidate the structural relationship among GPR30, S1-SDH projections, and CCK⁺ neurons.”

      (5) Figure 5: To increase the ease of interpreting the figure, in panels 5J and FN, it would be helpful to indicate directly on the figure panel which other marker was assessed in double-labeling analyses. 

      We have added direct labels to figure panels to clarify double-labeled analyses in the revised Figure 5J and 5N.

      Reviewer #2 (Recommendations for the authors): 

      (1) Can the authors provide more detail about the distribution of CCK+ cells in the spinal cord and, in particular, the localization of double-stained (CCK/cfos) neurons? 

      We thank the reviewer for this suggestion. To better characterize the distribution of CCK⁺ neurons within the spinal dorsal horn (SDH), we performed immunostaining in CCK-tdTomato mice using lamina-specific markers: CGRP (lamina I), IB4 (lamina II), and NF200 (lamina III–V). Our results demonstrate that CCK⁺ neurons are primarily localized in the deeper laminae of the SDH. These findings are now described in the revised Results (lines 126–129) and shown in Figure 2E.

      In addition, we conducted c-Fos immunostaining in CCK-Ai14 mice and found increased activation of CCK⁺ neurons following CCI. This supports the involvement of CCK⁺ neurons in neuropathic pain. These data are included in the Results (lines 129–131) and Supplementary Figure S4.

      (2) Figure 2A. There is no formal quantification of the percentage of TdTomato+ neurons that are also CCK+. The description of these results is insufficient. 

      We appreciate this point and have revised the description of Figure 2A accordingly. To strengthen our analysis, we conducted additional FISH experiments with vGluT2 and VGAT probes. Quantification revealed that GPR30 is predominantly expressed in excitatory neurons (approximately 60%). These data are shown in the revised Results (lines 117-119) and Figures 2A-B and S3. This supports our conclusion that GPR30 is largely localized to excitatory spinal interneurons.

      (3) Figure 4H. What is the evidence that these are AMPA-mediated currents? This is not explained in the text. 

      Thank you for raising this point. We now provide detailed experimental procedures to clarify that the recorded EPSCs are AMPA receptor–mediated. Specifically, spinal slices from CCK-Cre mice were used, and excitatory postsynaptic currents were recorded in the presence of APV (100 μM, NMDA receptor blocker), bicuculline (20 μM, GABA_A receptor blocker), and strychnine (0.5 μM, glycine receptor blocker), ensuring that the observed currents were AMPA-dependent. These methodological details are now clearly described in the revised Results (lines 165–173) and supported by prior literature (Zhang et al., J Biol Chem 2012; Hughes et al., J Neurosci 2010).

      (1) Yan Zhang, Xiao Xiao, Xiao-Meng Zhang, Zhi-Qi Zhao, Yu-Qiu Zhang (2012). Estrogen facilitates spinal cord synaptic transmission via membrane-bound estrogen receptors: implications for pain hypersensitivity. J Biol Chem. Sep 28;287(40):33268-81.

      (2) Ethan G Hughes, Xiaoyu Peng, Amy J Gleichman, Meizan Lai, Lei Zhou, Ryan Tsou, Thomas D Parsons, David R Lynch, Josep Dalmau, Rita J Balice-Gordon (2010). Cellular and synaptic mechanisms of anti-NMDA receptor encephalitis. J Neurosci. 2010 Apr 28;30(17):5866-75.

      (4) What is the signaling mechanism leading to a larger amplitude of currents after G-1 infusion? 

      We thank the reviewer for this important question. G-1 is a selective agonist for GPR30. Based on previous studies by Luo et al. (2016), we speculate that activation of GPR30 may increase the clustering of glutamatergic receptor subunits at postsynaptic sites, thereby enhancing AMPA receptor-mediated currents. While our current study did not directly address the intracellular signaling cascade, we have incorporated this mechanistic speculation in the Discussion.

      Jie Luo, X.H., Yali Li, Yang Li, Xueqin Xu, Yan Gao, Ruoshi Shi, Wanjun Yao, Juying Liu, Changbin Ke (2016). GPR30 disrupts the balance of GABAergic and glutamatergic transmission in the spinal cord driving to the development of bone cancer pain. Oncotarget 7, 73462-73472. 10.18632/oncotarget.11867.

      (5) Figure 4I. Please include error bars. 

      We have revised Figure 4I to include error bars, as requested.

      (6) Line 198. What is the evidence that AAV2/1 EF1α FLP is an antegrade trans monosynaptic marker? 

      We thank you for this request. AAV2/1 has been widely used for anterograde monosynaptic tracing based on its properties (Wang et al., Nat Neurosci 2024; Wu et al., Neurosci Bull 2021): (1) it infects neurons at the injection site and undergoes active anterograde transport; (2) newly assembled viral particles are released at synapses and infect postsynaptic partners; (3) in the absence of helper viruses, the spread halts at the first synapse, ensuring monosynaptic restriction. We have elaborated on this in the revised manuscript (line 198), citing Wang et al. (Nat Neurosci 2024) and Wu et al. (Neurosci Bull 2021).

      (1) Hao Wang, Qin Wang, Liuzhe Cui, Xiaoyang Feng, Ping Dong, Liheng Tan, Lin Lin, Hong Lian, Shuxia Cao, Huiqian Huang, Peng Cao, Xiao-Ming Li (2024). A molecularly defined amygdalaindependent tetra-synaptic forebrain-tohindbrain pathway for odor-driven innate fear and anxiety. Nat Neurosci. 2024 Mar;27(3):514-526.

      (2) Zi-Han Wu, Han-Yu Shao, Yuan-Yuan Fu, Xiao-Bo Wu, De-Li Cao, Sheng-Xiang Yan, Wei-Lin Sha, Yong-Jing Gao, Zhi-Jun Zhang (2021). Descending Modulation of Spinal Itch Transmission by Primary Somatosensory Cortex. Neurosci Bull. 2021 Sep;37(9):1345-1350.

      (7) Figure 5G. I do not understand the logic of this experiment. A Cre AAV is injected in the S1 cortex. Why should this lead to the expression of tdTomato on a downstream (postsynaptic?) neuron? The authors should quote the literature that supports this anterograde transsynaptic transport.

      We appreciate this question. As described in previous studies (e.g., Wu et al., Neurosci Bull 2021), AAV2/1-Cre injected into the S1 cortex leads to Cre expression in projection targets due to transsynaptic anterograde transport. Subsequent injection of a Cre-dependent AAV (AAV2/9-DIO-mCherry) into the spinal cord enables specific labeling of postsynaptic neurons that receive input from S1. We have clarified this mechanism in line 206 and provided the appropriate citation.

      Zi-Han Wu, Han-Yu Shao, Yuan-Yuan Fu, Xiao-Bo Wu, De-Li Cao, Sheng-Xiang Yan, Wei-Lin Sha, Yong-Jing Gao, Zhi-Jun Zhang (2021). Descending Modulation of Spinal Itch Transmission by Primary Somatosensory Cortex. Neurosci Bull. 2021 Sep;37(9):1345-1350.

      (8) The same question arises when interpreting the results obtained in Figure 6.

      We thank the reviewer for the question, and we have addressed it in point (7).

      (9) Line 257. How do the authors envision that estrogen would change its modulation of GPR30 under basal and neuropathic conditions? Is there any evidence for this speculation? 

      We thank the reviewer for raising this thoughtful question. In the current study, we focused on pharmacologically manipulating GPR30 activity via its selective agonist and antagonist. We did not directly investigate how endogenous estrogen regulates GPR30 under physiological and neuropathic states. We have recognized this limitation and highlighted the need for future research to investigate this regulatory mechanism.

      (10-20) In my opinion, the entire manuscript needs a careful revision of the English language. While one can follow the text, it contains numerous grammatical and syntactic errors that make the reading far from enjoyable. I am highlighting just a few of the many errors. 

      We appreciate the reviewer’s honest assessment. The manuscript has undergone thorough language editing by a native English speaker to correct grammatical errors, improve clarity, and enhance overall readability. We also restructured several sections, particularly the Discussion, to improve logical flow.

      (21) The discussion of results is a bit disorganized, with disconnected sentences and statements, and somewhat repetitive. For example, lines 303 to 306 lack adequate flow. It is also quite long and includes general statements that add little to the discussion of the new findings (lines 326-333). 

      We agree and have revised the Discussion extensively. Disconnected or repetitive sentences (e.g., lines 303-306, 326-333) have been removed or rewritten. For instance, we added a new transitional paragraph (lines 307-311) to improve flow:

      “Abnormal activation of neurons in the SDH is a key contributor to hyperalgesia, and enhanced excitatory synaptic transmission is a major mechanism driving increased neuronal excitability. Therefore, we evaluated excitatory postsynaptic currents (EPSCs) and observed increased amplitudes in CCK⁺ neurons following CCI, suggesting elevated excitability in these neurons.”

      We also removed redundant generalizations to maintain a focused discussion of our novel findings.

      Reviewer #3 (Recommendations for the authors): 

      (1) What is the distribution of GPR30 throughout the spinal cord and DRG? The authors demonstrate that this can overlap with a CCK+ population, but there are many GPR30+ and CCK negative neurons, even in the cropped images presented. It would be helpful to quantify the colocalization with CCK. 

      We thank the reviewer for this important point. As shown in the revised manuscript, GPR30 is expressed in both the spinal cord and dorsal root ganglia (DRG). However, our updated data (Figure 1B) demonstrate that Gper1 mRNA levels in the DRG are not significantly altered after CCI, suggesting a limited involvement of DRG GPR30 in neuropathic pain. These results are described in the revised Results (line 94).

      Regarding spinal co-expression, we performed a detailed quantification. Approximately 90% of CCK⁺ neurons express GPR30, while about 50% of GPR30⁺ neurons are CCK⁺. These co-localization results are now included in the revised Results and presented in Figure 2G.

      (2) It is clear that CCI and GPR30 influence excitatory synaptic transmission in CCK+ neurons. However, these experiments do not fully support the authors' claims of a postsynaptic upregulation of AMPARs. Comparing amplitudes and frequencies of spontaneous EPSCs cannot necessarily distinguish a pre- vs postsynaptic change since some of these EPSCs can arise from spontaneous action potential firing. I suggest revising this conclusion. 

      We appreciate these insightful comments. We fully agree that our data from spontaneous EPSC recordings (sEPSCs) in CCK⁺ neurons are not sufficient to distinguish between pre- and postsynaptic mechanisms, as sEPSCs may include spontaneous presynaptic activity. Therefore, we have revised the text throughout the manuscript to avoid overstating conclusions related to postsynaptic AMPA receptor upregulation.

      (3) What is the rationale for the evoked EPSC experiments from electrical stimulation in "the deep laminae of SDH?" I do not think that this experiment can rule out a presynaptic contribution of GPR30 to the evoked responses, particularly if these are Gs-coupled at presynaptic terminals. Paired-pulse stimulations could help answer this question, otherwise, alternative interpretations, also related to the point above, should be provided. 

      We thank the reviewer for this thoughtful critique. Indeed, electrical stimulation of the deep SDH laminae does not exclude presynaptic involvement, especially considering that GPR30 is a G protein–coupled receptor (GPCR) and could act presynaptically. We agree that paired-pulse ratio (PPR) analysis would be more informative in distinguishing pre- from postsynaptic effects, but this was not performed due to technical limitations in our current experimental setup.

      Accordingly, we have revised our interpretations in both the Results and Discussion to acknowledge that our data do not rule out presynaptic contributions. We now state that GPR30 activation enhances EPSCs in CCK⁺ neurons, while further studies are needed to dissect the precise site of action.

      (4) I appreciate the challenging nature of the trans-synaptic viral labeling approaches, but the chemogenetic and Gper knockdown experiments do not selectively target this CCK+ population of deep dorsal horn neurons. The data are clear that each of these components (descending corticospinal projections, CCK neurons, and GPR30) can modulate nociceptive hypersensitivity, but I do not agree with the overall conclusion that each of are directly linked as the authors propose. I recommend revising the overall conclusion and title to reflect the convincing data presented. 

      We thank the reviewer for this critical observation. We agree that while our data show functional roles for descending cortical input, CCK⁺ neurons, and GPR30 in modulating pain hypersensitivity, the evidence does not establish a definitive direct circuit integrating all three components.

      In response, we have revised our conclusions to reflect this limitation. Specifically, we avoided claiming a direct functional link among S1 projections, CCK⁺ neurons, and GPR30. Instead, we now propose that GPR30 modulates neuropathic pain primarily through its action in CCK⁺ spinal neurons, with potential involvement of descending facilitation from the somatosensory cortex.

      Additionally, we have revised the manuscript title to better reflect our mechanistic focus:<br /> “GPR30 in spinal CCK-positive neurons modulates neuropathic pain.”

      Minor Corrections

      (1) The authors should refer to mice by sex, not gender. 

      Corrected throughout the manuscript.

      (2) Page 9, line 195: "significantly" is used to refer to co-localization of 28.1%. What is this significant to? 

      We have revised the sentence to accurately describe the observed percentage, without implying statistical significance:

      “Our co-staining results revealed that a high proportion of CCK⁺ S1-SDH postsynaptic neurons expressed GPR30” (line 198-199).

      (3) I recommend modifying some of the transition phrases like "by the way," "what's more," and "besides". 

      All informal expressions have been replaced with academic alternatives including “Furthermore,” “Additionally,” and “Moreover.”

      (4) Additional guides to mark specific laminae in the dorsal horn would be useful. 

      We added immunostaining with laminar markers (CGRP for lamina I and NF200 for lamina III–V), and these data are now shown in Figure 2E and described in the Results (lines 126-129).

      (5) Page 5, line 115: immunochemistry should be immunohistochemistry. 

      Corrected as suggested.

      (6) Page 6, line 136: "Confirming the structural connnections" was not demonstrated here. Perhaps co-localization between GPR30 and CCK+. 

      The text was revised to “To functionally interrogate GPR30 and CCK⁺ neurons in neuropathic pain...” (line 133).

      (7) Page 8, line 166: unsure what "took and important role" means. 

      This phrasing was corrected for clarity and replaced with an accurate scientific description.

      (8) Page 8, line 168: "IPSCs of spinal CCK+ neurons" implies that they are sending inhibitory inputs. 

      We revised the term to “EPSCs” to correctly reflect excitatory synaptic currents in CCK⁺ neurons.

      (9) Page 8, line 169: "Known that EPSCs" is missing an introductory phrase. 

      The sentence was rewritten to include an appropriate introductory clause (lines 161–164):

      “Given that EPSCs are primarily mediated through glutamatergic receptors such as AMPA receptors...”

      (10) Page 10, line 227 and 228: "adequately" and "sufficiently" should be adequate and sufficient. 

      We corrected these terms to the proper adjective forms: “adequate” and “sufficient” (lines 224-225).

    1. Back to the university: what are you supposed to be learning here? At minimum, you’ll probably pick up bits of knowledge here and there, but an effective education isn’t just about memorizing facts. It’s much more than about learning that but also learning how, especially given Cal Poly’s motto of “Learn by Doing.” But if you rely on using AI for your coursework, you might not even be learning that some particular thing is true. With AI and search engines, you can still access that knowledge you’re supposed to be learning, but being able to access x isn’t the same as internalizing x; the latter is much more useful, as we’ll discuss more below in part 3, “Future risks.”

      I think this is an important distinction. Just being able to access information with AI isn’t the same as actually learning and internalizing it. Memorizing facts may not be the point of education, but being able to apply and use knowledge is. If we skip the process of working through ideas ourselves then we risk missing the deeper how of learning

    1. Suzanne Briet: Physical evidence as document

      In part, I appreciate the pragmatism of Briet's approach. It would certainly make a cataloger's life easier to view documents in this way and, on its surface, it makes a tremendous amount of "sense".

      However, I can't help but feel this view is a little too limited. Certainly, it seems to me, the antelope itself would be a source of information. In one way it is an example of what an "antelope" is, but it is also an individual and, beyond that, an individual at a certain snapshot in time.

      In a very broad view, we can think that nothing is truly permanent as all things are constantly changing. I think it depends so much on how we observe and questions of time scale.

      Human beings are not even exactly what we were in the past. We grow (both physically and in other ways), we change (we age, we change our minds, we change our clothes, we get tattoos, we erase tattoos) and eventually we, as an individual, will cease to exist by any observable means (depending on your belief system) other than by the "things" we leave behind.

      We also continue to exist, in a sense, in the minds of those who knew us, but their memories cannot be a whole picture of who we were and certainly no one may know truly how we are inside our own heads. Others will certainly bring their own biases or preferences to their memories of us which may or may not be a complete picture of who we were.

    1. Author response:

      Reviewer #1 (Public review):

      In this important study, the authors develop a suite of machine vision tools to identify and align fluorescent neuronal recording images in space and time according to neuron identity and position. The authors provide compelling evidence for the speed and utility of these tools. While such tools have been developed in the past (including by the authors), the key advancement here is the speed and broad utility of these new tools. While prior approaches based on steepest descent worked, they required hundreds of hours of computational time, while the new approaches outlined here are >600-fold faster. The machine vision tools here should be immediately useful to readers specifically interested in whole-brain C. elegans data, but also for more general readers who may be interested in using BrainAlignNet for tracking fluorescent neuronal recordings from other systems.

      I really enjoyed reading this paper. The authors had several ground truth examples to quantify the accuracy of their algorithms and identified several small caveats users should consider when using these tools. These tools were primarily developed for C. elegans, an animal with stereotyped development, but whose neurons can be variably located due to internal motion of the body. The authors provide several examples of how BrainAlignNet reliably tracked these neurons over space and time. Neuron identity is also important to track, and the authors showed how AutoCellLoader can reliably identify neurons based on their fluorescence in the NeuroPAL background. A challenge with NeuroPAL though, is the high expression of several fluorophores, which compromises behavioral fidelity. The authors provide some possible avenues where this problem can be addressed by expressing fewer fluorophores. While using all four channels provided the best performance, only using the tagRFP and CyOFP channels was sufficient for performance that was close to full performance using all 4 NeuroPAL channels. This result indicates that the development of future lines with less fluorophore expression could be sufficient for reliable neuronal identification, which would decrease the genetic load on the animal, but also open other fluorescent channels that could be used for tracking other fluorescent tools/markers. Even though these tools were developed for C. elegans specifically, they showed BrainAlignNet can be applied to other organisms as well (in their case, the cnidarian C. hemisphaerica), which broadens the utility of their tools.

      Strengths:

      (1) The authors have a wealth of ground-truth training data to compare their algorithms against, and provide a variety of metrics to assess how well their new tools perform against hand annotation and/or prior algorithms.

      (2) For BrainAlignNet, the authors show how this tool can be applied to other organisms besides C. elegans.

      (3) The tools are publicly available on GitHub, which includes useful README files and installation guidance.

      We thank the reviewer for noting these strengths of our study.

      Weaknesses:

      (1) Most of the utility of these algorithms is for C. elegans specifically. Testing their algorithms (specifically BrainAlignNet) on more challenging problems, such as whole-brain zebrafish, would have been interesting. This is a very, very minor weakness, though.

      We appreciate the reviewer’s point that expanding to additional animal models would be valuable. In the study, we have so far tested our approaches on C. elegans and Jellyfish. Given that this is considered a ‘very, very minor weakness’ and that it does not directly affect the results or analyses in the paper, we think this might be better to address in future work.

      (2) The tools are benchmarked against their own prior pipeline, but not against other algorithms written for the same purpose.

      We agree that it would be valuable to benchmark other labs’ software pipelines on our datasets. We note that most papers in this area, which describe those pipelines, provide the same performance metrics that we do (accuracy of neuron identification, tracking accuracy, etc), so a crude, first-order comparison can be obtained by comparing the numbers in the papers. But, we agree that a rigorous head-to-head comparison would require applying these different pipelines to a common dataset. We considered performing these analyses, but we were concerned that using other labs’ software ‘off the shelf’ on our data might not represent those pipelines in their best light when compared to our pipeline that was developed with our data in mind. Data from different microscopy platforms can be surprisingly different and we wouldn’t want to perform an analysis that had this bias. Therefore, we feel that this comparison would be best pursued by all of these labs collaboratively (so that they can each provide input on how to run their software optimally). Indeed, this is an important area for future study. In this spirit, we have been sharing our eat-4::GFP datasets (that permit quantification of tracking accuracy) with other labs looking for additional ways to benchmark their tracking software.

      We also note that there are not really any pipelines to directly compare against CellDiscoveryNet, as we are not aware of any other fully unsupervised approach for neuron identification in C. elegans.

      (3) Considerable pre-processing was done before implementation. Expanding upon this would improve accessibility of these tools to a wider audience.

      Indeed, some pre-processing was performed on images before registration and neuron identification -- understanding these nuances can be important. The pre-processing steps are described in the Results section and detailed in the Methods. They are also all available in our open-source software. For BrainAlignNet, the key steps were: (1) selecting image registration problems, (2) cropping, and (3) Euler alignment. Steps (1) and (3) were critically important and are extensively discussed in the Results and Discussion sections of our study (lines 142-144, 218-234, 318-323, 704-712). Step (2) is standard in image processing. For AutoCellLabeler and CellDiscoveryNet, the pre-processing was primarily to align the 4 NeuroPAL color channels to each other (i.e. make sure the blue/red/orange/etc channels for an animal are perfectly aligned). This is also just a standard image processing step to ensure channel alignment. Thus, the more “custom” pre-processing steps were extensively discussed in the study and the more “common” steps are still described in the Methods. The implementation of all steps is available in our open-source software.

      Reviewer #2 (Public review):

      Summary:

      The paper introduced the pipeline to analyze brain imaging of freely moving animals: registering deforming tissues and maintaining consistent cell identities over time. The pipeline consists of three neural networks that are built upon existing models: BrainAlignNet for non-rigid registration, AutoCellLabeler for supervised annotation of over 100 neuronal types, and CellDiscoveryNet for unsupervised discovery of cell identities. The ambition of the work is to enable high-throughput and largely automated pipelines for neuron tracking and labeling in deforming nervous systems.

      Strengths:

      (1) The paper tackles a timely and difficult problem, offering an end-to-end system rather than isolated modules.

      (2) The authors report high performance within their dataset, including single-pixel registration accuracy, nearly complete neuron linking over time, and annotation accuracy that exceeds individual human labelers.

      (3) Demonstrations across two organisms suggest the methods could be transferable, and the integration of supervised and unsupervised modules is of practical utility.

      We thank the reviewer for noting these strengths of our study.

      Weaknesses:

      (1) Lack of solid evaluation. Despite strong results on their own data, the work is not benchmarked against existing methods on community datasets, making it hard to evaluate relative performance or generality.

      We agree that it would be valuable to benchmark many labs’ software pipelines on some common datasets, ideally from several different research labs. We note that most papers in this area, which describe the other pipelines that have been developed, provide the same performance metrics that we do (accuracy of neuron identification, tracking accuracy, etc), so a crude, first-order comparison can be obtained by comparing the numbers in the papers. But, we agree that a rigorous head-to-head comparison would require applying these different pipelines to a common dataset. We considered performing these analyses, but we were concerned that using other labs’ software ‘off the shelf’ and comparing the results to our pipeline (where we have extensive expertise) might bias the performance metrics in favor of our software. Therefore, we feel that this comparison would be best pursued by all of these labs collaboratively (so that they can each provide input on how to run their software optimally). Indeed, this is an important area for future study. In this spirit, we have been sharing our eat-4::GFP datasets (that permit quantification of tracking accuracy) with other labs looking for additional ways to benchmark their tracking software.

      We also note that there are not really any pipelines to directly compare against CellDiscoveryNet, as we are not aware of any other fully unsupervised approach for neuron identification in C. elegans.

      (2) Lack of novelty. All three models do not incorporate state-of-the-art advances from the respective fields. BrainAlignNet does not learn from the latest optical flow literature, relying instead on relatively conventional architectures. AutoCellLabeler does not utilize the advanced medNeXt3D architectures for supervised semantic segmentation. CellDiscoveryNet is presented as unsupervised discovery but relies on standard clustering approaches, with limited evaluation on only a small test set.

      We appreciate that the machine learning field moves fast. Our goal was not to invent entirely novel machine learning tools, but rather to apply and optimize tools for a set of challenging, unsolved biological problems. We began with the somewhat simpler architectures described in our study and were largely satisfied with their performance. It is conceivable that newer approaches would perhaps lead to even greater accuracy, flexibility, and/or speed. But, oftentimes, simple or classical solutions can adequately resolve specific challenges in biological image processing.

      Regarding CellDiscoveryNet, our claim of unsupervised training is precise: CellDiscoveryNet is trained end-to-end only on raw images, with no human annotations, pseudo-labels, external classifiers, or metadata used for training, model selection, or early stopping. The loss is defined entirely from the input data (no label signal). By standard usage in machine learning, this constitutes unsupervised (often termed “self-supervised”) representation learning. Downstream clustering is likewise unsupervised, consuming only image pairs registered by CellDiscoveryNet and neuron segmentations produced by our previously-trained SegmentationNet (which provides no label information).

      (3) Lack of robustness. BrainAlignNet requires dataset-specific training and pre-alignment strategies, limiting its plug-and-play use. AutoCellLabeler depends heavily on raw intensity patterns of neurons, making it brittle to pose changes. By contrast, current state-of-the-art methods incorporate spatial deformation atlases or relative spatial relationships, which provide robustness across poses and imaging conditions. More broadly, the ANTSUN 2.0 system depends on numerous manually tuned weights and thresholds, which reduces reproducibility and generalizability beyond curated conditions.

      Regarding BrainAlignNet: we agree that we trained on each species’ own data (worm, jellyfish) and we would suggest other labs working on new organisms to do the same based on our current state of knowledge. It would be fantastic if there was an alignment approach that generalized to all possible cases of non-rigid-registration in all animals – an important area for future study. We also agree that pre-alignment was critical in worms and jellyfish, which we discuss extensively in our study (lines 142-144, 318-321, 704-712).

      Regarding AutoCellLabeler: the animals were not recorded in any standardized pose and were not aligned to each other beforehand – they were basically in a haphazard mix of poses and we used image augmentation to allow the network to generalize to other poses, as described in our study. It is still possible that AutoCellLabeler is somehow brittle to pose changes (e.g. perhaps extremely curved worms) – while we did not detect this in our analyses, we did not systematically evaluate performance across all possible poses. However, we do note that this network was able to label images taken from freely-moving worms, which by definition exhibit many poses (Figure 5D, lines 500-525); aggregating the network’s performance across freely-moving data points allowed it to nearly match its performance on high-SNR immobilized data. This suggests a degree of robustness of the AutoCellLabeler network to pose changes.

      Regarding ANTSUN 2.0: we agree that there are some hyperparameters (described in our study) that affect ANTSUN performance. We agree that it would be worthwhile to fully automate setting these in future iterations of the software.

      Evaluation:

      To make the evaluation more solid, it would be great for the authors to (1) apply the new method on existing datasets and (2) apply baseline methods on their own datasets. Otherwise, without comparison, it is unclear if the proposed method is better or not. The following papers have public challenging tracking data: https://elifesciences.org/articles/66410, https://elifesciences.org/articles/59187, https://www.nature.com/articles/s41592-023-02096-3.

      Please see our response to your point (1) under Weaknesses above.

      Methodology:

      (1) The model innovations appear incrementally novel relative to existing work. The authors should articulate what is fundamentally different (architectural choices, training objectives, inductive biases) and why those differences matter empirically. Ablations isolating each design choice would help.

      There are other efforts in the literature to solve the neuron tracking and neuron identification problems in C. elegans (please see paragraphs 4 and 5 of our Introduction, which are devoted to describing these). However, they are quite different in the approaches that they use, compared to our study. For example, for neuron tracking they use t->t+1 methods, or model neurons as point clouds, etc (a variety of approaches have been tried). For neuron identification, they work on extracted features from images, or use statistical approaches rather than deep neural networks, etc (a variety of approaches have been tried). Our assessment is that each of these diverse approaches has strengths and drawbacks; we agree that a meta-analysis of the design choices used across studies could be valuable.

      We also note that there are not really any pipelines to directly compare against CellDiscoveryNet, as we are not aware of any other fully unsupervised approach for neuron identification in C. elegans.

      (2) The pipeline currently depends on numerous manually set hyperparameters and dataset-specific preprocessing. Please provide principled guidelines (e.g., ranges, default settings, heuristics) and a robustness analysis (sweeps, sensitivity curves) to show how performance varies with these choices across datasets; wherever possible, learn weights from data or replace fixed thresholds with data-driven criteria.

      We agree that there are some ANTSUN 2.0 hyperparameters (described in our Methods section) that could affect the quality of neuron tracking. It would be worthwhile to fully automate setting these in future iterations of the software, ensuring that the hyperparameter settings are robust to variation in data/experiments.

      Appraisal:

      The authors partially achieve their aims. Within the scope of their dataset, the pipeline demonstrates impressive performance and clear practical value. However, the absence of comparisons with state-of-the-art algorithms such as ZephIR, fDNC, or WormID, combined with small-scale evaluation (e.g., ten test volumes), makes the strength of evidence incomplete. The results support the conclusion that the approach is useful for their lab's workflow, but they do not establish broader robustness or superiority over existing methods.

      We wish to remind the reviewer that we developed BrainAlignNet for use in worms and jellyfish. These two animals have different distributions of neurons and radically different anatomy and movement patterns. Data from the two organisms was collected in different labs (Flavell lab, Weissbourd lab) on different types of microscopes (spinning disk, epifluorescence). We believe that this is a good initial demonstration that the approach has robustness across different settings.

      Regarding comparisons to other labs’ C. elegans data processing pipelines, we agree that it will be extremely valuable to compare performance on common datasets, ideally collected in multiple different research labs. But we believe this should be performed collaboratively so that all software can be utilized in their best light with input from each lab, as described above. We agree that such a comparison would be very valuable.

      Impact:

      Even though the authors have released code, the pipeline requires heavy pre- and post-processing with numerous manually tuned hyperparameters, which limits its practical applicability to new datasets. Indeed, even within the paper, BrainAlignNet had to be adapted with additional preprocessing to handle the jellyfish data. The broader impact of the work will depend on systematic benchmarking against community datasets and comparison with established methods. As such, readers should view the results as a promising proof of concept rather than a definitive standard for imaging in deformable nervous systems.

      Regarding worms vs jellyfish pre-processing: we actually had the exact opposite reaction to that of the reviewer. We were surprised at how similar the pre-processing was for these two very different organisms. In both cases, it was essential to (1) select appropriate registration problems to be solved; and (2) perform initialization with Euler alignment. Provided that these two challenges were solved, BrainAlignNet mostly took care of the rest. This suggests a clear path for researchers who wish to use this approach in another animal. Nevertheless, we also agree with the reviewer’s caution that a totally different use case could require some re-thinking or re-strategizing. For example, the strategy of how to select good registration problems could depend on the form of the animal’s movement.

      Reviewer #3 (Public review):

      Context:

      Tracking cell trajectories in deformable organs, such as the head neurons of freely moving C. elegans, is a challenging task due to rapid, non-rigid cellular motion. Similarly, identifying neuron types in the worm brain is difficult because of high inter-individual variability in cell positions.

      Summary:

      In this study, the authors developed a deep learning-based approach for cell tracking and identification in deformable neuronal images. Several different CNN models were trained to: (1) register image pairs without severe deformation, and then track cells across continuous image sequences using multiple registration results combined with clustering strategies; (2) predict neuron IDs from multicolor-labeled images; and (3) perform clustering across multiple multicolor images to automatically generate neuron IDs.

      Strengths:

      Directly using raw images for registration and identification simplifies the analysis pipeline, but it is also a challenging task since CNN architectures often struggle to capture spatial relationships between distant cells. Surprisingly, the authors report very high accuracy across all tasks. For example, the tracking of head neurons in freely moving worms reportedly reached 99.6% accuracy, neuron identification achieved 98%, and automatic classification achieved 93% compared to human annotations.

      We thank the reviewer for noting these strengths of our study.

      Weaknesses:

      (1) The deep networks proposed in this study for registration and neuron identification require dataset-specific training, due to variations in imaging conditions across different laboratories. This, in turn, demands a large amount of manually or semi-manually annotated training data, including cell centroid correspondences and cell identity labels, which reduces the overall practicality and scalability of the method.

      We performed dataset-specific training for image registration and neuron identification, and we would encourage new users to do the same based on our current state of knowledge. This highlights how standardization of whole-brain imaging data across labs is an important issue for our field to address and that, without it, variations in imaging conditions could impact software utility. We refer the reviewer to an excellent study by Sprague et al. (2025) on this topic, which is cited in our study.

      However, at the same time, we wish to note that it was actually reasonably straightforward to take the BrainAlignNet approach that we initially developed in C. elegans and apply it to jellyfish. Some of the key lessons that we learned in C. elegans generalized: in both cases, it was critical to select the right registration problems to solve and to preprocess with Euler registration for good initialization. Provided that those problems were solved, BrainAlignNet could be applied to obtain high-quality registration and trace extraction. Thus, our study provides clear suggestions on how to use these tools across multiple contexts.

      (2) The cell tracking accuracy was not rigorously validated, but rather estimated using a biased and coarse approach. Specifically, the accuracy was assessed based on the stability of GFP signals in the eat-4-labeled channel. A tracking error was assumed to occur when the GFP signal switched between eat-4-negative and eat-4-positive at a given time point. However, this estimation is imprecise and only captures a small subset of all potential errors. Although the authors introduced a correction factor to approximate the true error rate, the validity of this correction relies on the assumption that eat-4 neurons are uniformly distributed across the brain - a condition that is unlikely to hold.

      We respectfully disagree with this critique. We considered the alternative suggested by the reviewer (in their private comments to the authors) of comparing against a manually annotated dataset. But this annotation would require manually linking ~150 neurons across ~1600 timepoints, which would require humans to manually link neurons across timepoints >200,000 times for a single dataset. These datasets consist of densely packed neurons rapidly deforming over time in all 3 dimensions. Moreover, a single error in linking would propagate across timepoints, so the error tolerance of such annotation would be extremely low. Any such manually labeled dataset would be fraught with errors and should not be trusted. Instead, our approach relies on a simple, accurate assumption: GFP expression in a neuron should be roughly constant over a 16min recording (after bleach correction) and the levels will be different in different neurons when it is sparsely expressed. Because all image alignment is done in the red channel, the pipeline never “peeks” at the GFP until it is finished with neuron alignment and tracking. The eat-4 promoter was chosen for GFP expression because (a) the nuclei labeled by it are scattered across the neuropil in a roughly salt-and-pepper fashion – a mixture of eat-4-positive and eat-4-negative neurons are found throughout the head; and (b) it is in roughly 40% of the neurons, giving very good overall coverage. Our view is that this approach of labeling subsets of neurons with GFP should become the standard in the field for assessing tracking accuracy – it has a simple, accurate premise; is not susceptible to human labeling error; is straightforward to implement; and, since it does not require manual labeling, is easy to scale to multiple datasets. We do note that it could be further strengthened by using multiple strains each with different ‘salt-and-pepper’ GFP expression patterns.

      (3) Figure S1F demonstrates that the registration network, BrainAlignNet, alone is insufficient to accurately align arbitrary pairs of C. elegans head images. The high tracking accuracy reported is largely due to the use of a carefully designed registration sequence, matching only images with similar postures, and an effective clustering algorithm. Although the authors address this point in the Discussion section, the abstract may give the misleading impression that the network itself is solely responsible for the observed accuracy.

      Our tracking accuracy requires (a) a careful selection of registration problems, (b) highly accurate registration of the selected registration problems, and (c) effective clustering. We extensively discussed the importance of the choosing of the registration problems in the Results section (lines 218-234 and 318-321), Discussion section (lines 704-708), and Methods section (955-970 and 1246-1250) of our paper. We also discussed the clustering aspect in the Results section (lines 247-259), Discussion section (lines 708-712), and Methods section (lines 1162-1206). In addition, our abstract states that the BrainAlignNet needs to be “incorporated into an image analysis pipeline,” to inform readers that other aspects of image analysis need to occur (beyond BrainAlignNet) to perform tracking.

      (4) The reported accuracy for neuron identification and automatic classification may be misleading, as it was assessed only on a subset of neurons labeled as "high-confidence" by human annotators. Although the authors did not disclose the exact proportion, various descriptions (such as Figure 4f) imply that this subset comprises approximately 60% of all neurons. While excluding uncertain labels is justifiable, the authors highlight the high accuracy achieved on this subset without clearly clarifying that the reported performance pertains only to neurons that are relatively easy to identify. Furthermore, they do not report what fraction of the total neuron population can be accurately identified using their methods-an omission of critical importance for prospective users.

      The reviewer raises two points here: (1) whether AutoCellLabeler accuracy is impacted by ease of human labeling; and (2) what fraction of total neurons are identified. We address them one at a time.

      Regarding (1), we believe that the reviewer overlooked an important analysis in our study. Indeed, to assess its performance, one can only compare AutoCellLabeler’s output against accurate human labels – there is simply no way around it. However, we noted that AutoCellLabeler was identifying some neurons with high confidence even when humans had low confidence or had not even tried to label the neurons (Fig. 4F). To test whether these were in fact accurate labels, we asked additional human labelers to spend extra time trying to label a random subset of these neurons (they were of course blinded to the AutoCellLabeler label). We then assessed the accuracy of AutoCellLabeler against these new human labels and found that they were highly accurate (Fig. 4H). This suggests that AutoCellLabeler has strong performance even when some human labelers find it challenging to label a neuron. However, we agree that we have not yet been able to quantify AutoCellLabeler performance on the small set of neuron classes that humans are unable to identify across datasets.

      Regarding (2), we agree that knowing how many neurons are labeled by AutoCellLabeler is critical. For example, labeling only 3 neurons per animal with 100% accuracy isn’t very helpful. We wish to emphasize that we did not omit this information: we reported the number of neurons labeled for every network that we characterized in the study, alongside the accuracy of those labels (please see Figures 4I, 5A, and 6G; Figure 4I also shows the number of human labels per dataset, which the reviewer requested). We also showed curves depicting the tradeoff between accuracy and number of neurons labeled, which fully captures how we balanced accuracy and number of neurons labeled (Figures 5D and S4A). It sounds like the reviewer also wanted to know the total number of recorded neurons. The typical number of recorded neurons per dataset can also be found in the paper in Fig. 2E.

    1. Reviewer #1 (Public review):

      Summary:

      This study focuses on characterizing the EEG correlates of item-specific proportion congruency effects. In particular, two types of learned associations are characterized. One being associations between stimulus features and control states (SC), and the other being stimulus features and responses (SR). Decoding methods are used to identify SC and SR correlates and to determine whether they have similar topographies and dynamics.

      The results suggest SC and SR associations are simultaneously coactivated and have shared topographies, with the inference being that these associations may share a common generator.

      Strengths:

      Fearless, creative use of EEG decoding to test tricky hypotheses regarding latent associations.

      Nice idea to orthogonalize the ISPC condition (MC/MI) from stimulus features.

      Weaknesses:

      (1) I'm relatively concerned that these results may be spurious. I hope to be proven wrong, but I would suggest taking another look at a few things.

      While a nice idea in principle, the ISPC manipulation seems to be quite confounded with the trial number. E.g., color-red is MI only during phase 2, and is MC primarily only during Phase 3 (since phase 1 is so sparsely represented). In my experience, EEG noise is highly structured across a session and easily exploited by decoders. Plus, behavior seems quite different between Phase 2 and Phase 3. So, it seems likely that the classes you are asking the decoder to separate are highly confounded with temporally structured noise.

      I suggest thinking of how to handle this concern in a rigorous way. A compelling way to address this would be to perform "cross-phase" decoding, however I am not sure if that is possible given the design.

      The time courses also seem concerning. What are we to make of the SR and SC timecourses, which have aggregate decoding dynamics that look to be <1Hz?

      Some sanity checks would be one place to start. Time courses were baselined, but this is often not necessary with decoding; it can cause bias (10.1016/j.jneumeth.2021.109080), and can mask deeper issues. What do things look like when not baselined? Can variables be decoded when they should not be decoded? What does cross-temporal decoding look like - everything stable across all times, etc.?

      (2) The nature of the shared features between SR and SC subspaces is unclear.

      The simulation is framed in terms of the amount of overlap, revealing the number of shared dimensions between subspaces. In reality, it seems like it's closer to 'proportion of volume shared', i.e., a small number of dominant dimensions could drive a large degree of alignment between subspaces.

      What features drive the similarity? What features drive the distinctions between SR and SC? Aside from the temporal confounds I mentioned above, is it possible that some low-dimensional feature, like EEG congruency effect (e.g., low-D ERPs associated with conflict), or RT dynamics, drives discriminability among these classes? It seems plausible to me - all one would need is non-homogeneity in the size of the congruency effect across different items (subject-level idiosyncracies could contribute: 10.1016/j.neuroimage.2013.03.039).

      (3) The time-resolved within-trial correlation of RSA betas is a cool idea, but I am concerned it is biased. Estimating correlations among different coefficients from the same GLM design matrix is, in general, biased, i.e., when the regressors are non-orthogonal. This bias comes from the expected covariance of the betas and is discussed in detail here (10.1371/journal.pcbi.1006299). In short, correlations could be inflated due to a combination of the design matrix and the structure of the noise. The most established solution, to cross-validate across different GLM estimations, is unfortunately not available here. I would suggest that the authors think of ways to handle this issue.

      (4) Are results robust to running response-locked analyses? Especially the EEG-behavior correlation. Could this be driven by different RTs across trials & trial-types? I.e., at 400 ms post-stim onset, some trials would be near or at RT/action execution, while others may not be nearly as close, and so EEG features would differ & "predict" RT.

      (5) I suggest providing more explanation about the logic of the subspace decoding method - what trialtypes exactly constitute the different classes, why we would expect this method to capture something useful regarding ISPC, & what this something might be. I felt that the first paragraph of the results breezes by a lot of important logic.

      In general, this paper does not seem to be written for readers who are unfamiliar with this particular topic area. If authors think this is undesirable, I would suggest altering the text.

    1. Note: This response was posted by the corresponding author to Review Commons. The content has not been altered except for formatting.

      Learn more at Review Commons


      Reply to the reviewers

      1. General Statements

      We thank the reviewers for providing thoughtful and constructive feedback, which will help us improve the clarity and rigor of the paper. On balance, the reviews were positive. Reviewer 1 mentioned that “This is a strong manuscript with few problems and all important findings well justified, indeed this is a nicely polished…..high-quality manuscript,” and that “this paper makes a major breakthrough, showing that cell autonomous defects in hTSCs are very likely at the heart of the pathology observed in GIN-prone murine mutants.” Reviewer 3 stated that “The study is well designed, and the manuscript is very well written. The conclusions are supported by the evidence presented.” Reviewer 2 was less enthusiastic, with main concerns being that “The paper is mostly descriptive and often quite confusing leaving one not much closer to understanding the mechanistic basis for the interesting sex-biased semi-lethal phenotype.” and felt that figure titles/section headers overstated the results, and finally recommended to improve some technical aspects and tempering conclusions. The proposed edits we think address most issues raised by the reviewers either with re-writing or adding data as described below.

      In response to reviewer #1 comments:

      Major comments:

      • I am confused as to the basis of the sex-skewing phenomenon? Is the problem that lack of maternally loaded WT Mcm4 worsens the phenotype, or is the issue that Mcm4C3/C3 dams are less able to retain pregnancies, perhaps being a more inflammatory environment? Also, while there quite consistent evidence for reduced viability of Mcm4C3/C3McmGt/+ progeny, especially for female progeny, how confident can we be that the genotype of the dam vs. sire is important? Notably on a Ddx58 background, the progeny of the Mcm4C3/C3 sire included seven live male Mcm4C3/C3McmGt/+ but no female.

      Regarding the first point (sex skewing only when female is C3/C3), we also suspected either: 1) the maternal uterine environment, or 2) reduced oocyte quality. Although not reported in this manuscript, we tested #1 by performing embryo transfer experiments. Transferring 2-cell stage embryos from sex-skewing mating to WT females did not rescue the sex-bias. We then examined oocytes from C3/C3 females. We found evidence for compromised mitochondria and transcriptome disruption. However, we are not sure why this happens (poor follicle support? Oocyte intrinsic phenomenon?). We are reserving these results and additional experiments for another paper, especially since this one mainly deals with GIN and placenta development. If the reviewers feel strongly that the embryo transfer data is crucial, we can include it.

      Regarding how confident we are that the genotype of the dam vs. sire is important, this stems from our previous paper by McNairn et al 2019 (the percentage of female C3/C3 M2/+ from sex-skewing mating is 20% compared to 60% from the reciprocal mating), which was quite dramatic. Consistent with this, MCM levels were significantly reduced in the placentae only when the dam was C3/C3 and the sire C3/+ M2/+, but not in the reciprocal cross. The reviewer makes a good observation about the Ddx58 cross; we can only hypothesize that the mutation somehow sensitizes females in this scenario and will make mention of it in the revision. We also realize that we neglected to write in Methods that the Ddx58 allele was coisogenic in the C3H background.

      • I'm not sure what Supplementary Figure 6 is showing (faster differentiation of C3 but less TGC?). Regardless, it's hard to draw too much conclusion from one not-very-pretty Western blot. This figure requires both additional replicates and a better explanation of how it fits with the other conclusions of the paper..

      We hypothesized that the JZ defect observed in the semi-lethal genotype placentas could arise either from impaired maintenance of the progenitor pool or from reduced capacity of mutant trophoblast progenitors to differentiate into the JZ lineage. The blot in Supplementary Figure 6 was intended as a qualitative demonstration that mutant trophoblast stem cells can differentiate into JZ lineages. We recognize that the figure is not definitive and will revise the text to clarify its purpose. A replicate(s) of the Western will be performed as suggested.

      • Supplementary Figure 7F-G is puzzling. Half of the mESCs have gamma-H2AX at all times, including most in S or G2 phase? In Figure S7E, do the quadrants correspond to being negative or positive for gamma-H2AX? At very least, IF images showing clear gamma-H2AX foci would be much more convincing.

      The gates for γH2AX FACS analysis were established using negative controls lacking primary antibody. As reported previously, embryonic stem cells display high basal levels of γH2AX staining (Chuykin et al., Cell Cycle 2008; Turinetto et al., Stem Cells 2012; Ahuja et al., Nat Comm 2016), which likely explains the broad signal observed across cell cycle phases. Regardless, we will provide immunofluorescence staining of γH2Ax and foci count in our revision.

      • The methods section is well detailed, but it would be ideal to clarify how many replicates each Western Blot or flow cytometry experiment is representative of.

      Thanks for the suggestion. We will update this for Fig4 and Fig5.

      Minor comments:

      • Is it possible that cGAS-STING and RIG pathways act redundantly to cause inflammation and lethality, or that other innate immune components are involved? I don't expect the authors to make compound mutants to test this but at least this possibility should be discussed textually.

      We appreciate the reviewer’s point, and had the same suspicion. Supporting this, we will add new RNA-seq analysis of Tmem173 KO placentas revealed elevated inflammatory gene expression compared to C3/C3 M2/+ controls, consistent with potential redundancy or feedback regulation. We will update in supplementary figures to reflect this.

      In response to reviewer #2 comments:

      Major comments:

      A major concern throughout the paper is that conclusions are often overstating their data. The title of figure 2 is "placentae with replication stress have smaller junctional and labyrinth zones". However, there is no measure of replication stress in this figure, just a histological evaluation of the placentae from the different mutants. The title of figure 3 is "Impact of GIN on LZ is less than JZ," but there is no measure of GIN, but instead measurement of number of cells in cell cycle and some bulk RNA-seq analysis. Title of figure 4 is "TSCs with increased genomic instability exhibit abnormal phenotypes." Again there is no measure of GIN, but instead staining of derived TSCs for proliferation, cell death, and a TSC marker. Title of figure 5 is "DNA damage responses and G2/M checkpoint activation drive premature TSC differentiation." However, there does not appear to be a difference in gH2AX between the two mutant genotypes. Checkpoint proteins might be up, but need quantification and reproduction. > 4C is the only marker of differentiation. Importantly, all the analyses here are associations, not connections, so cannot use the word "drive". Similar issues can be raised with a number of the supplementary figures.

      The Chaos3 (chromosome aberrations occurring spontaneously 3) model is a well-established system of intrinsic chronic replication stress and GIN. It is characterized by ~20 fold elevation of blood micronuclei (Shima et al., Nature 2007), a hallmark of GIN (Soxena et al., Mol Cell 2022); a destabilized MCM2-7 helicase prone to replication fork collapse (Bai et al., PLoS Genet 2016); and increased mitotic chromosome abnormalities and decreased dormant origins (Kawabata et al., Mol Cell 2011; Chuang et al., Nucleic Acid Res 2012) that are known to cause GIN and replication stress (Ibarra et al., PNAS 2008 ). Also, in our previous work (McNairn et al Nature 2019), we showed that placentae from C3/C3 dams exhibit significantly elevated γH2Ax as well as reduced MCM2 and MCM4 protein levels. In our current study, we also observe elevated γH2Ax in mutant TSCs (C3/C3 and C3/C3 M2/+), consistent with genomic instability. Nevertheless, we acknowledge that in TSCs, we did not formally demonstrate replications stress(RS), so where appropriate, we will advise figure titles, for example to say that “cells/placentae with a GIN or RS genotype.”

      We acknowledge the reviewers concern regarding western blots. We will provide quantification and statistics in our revision.

      1) A deeper analysis of the cell lines is likely to be the most fruitful path to reveal interesting mechanisms. It is very surprising that there is no phenotype in ESCs. Authors should check for increased apoptosis. Maybe the phenotypic cells are lost. Or do ESCs use different MCMs/mechanisms of DNA replication or are they better able to handle replication stress and GIN? How many passages were the TSCs and ESCs cultured for? Does GIN (i.e. aneuploidy, CNVs) develop in TSCs and ESCs with passaging? How do the MCM mutations impact the molecular identity of the ESC and TSC cells including their heterogeneity in the population.

      We assessed apoptosis using cleaved caspase 3 flow cytometry in mutant ESCs and observed no difference compared to controls (we will add this data as Supplementary Fig. 7).

      We believe there are intrinsic differences in TSCs and ESCs in their ability to respond to and counteract replication stress and DNA damage. ESCs are known to license more replication origins than somatic cells at a higher rate, which protects them from short G1-induced replication stress (Ahuja et al., Nat Comm 2016; Ge et al., Stem Cell Rep 2015; Matson et al., eLife 2017). Human placental cells physiologically exhibit high levels of mutation rate and chromosomal instability in vivo (Coorens et al., Nature 2021). Supporting this, Wang, D., et al (Nat Comm 2025) reported that several cell cycle and DDR regulators are differentially expressed in human TSCs vs human pluripotent stem cells. Whether such transcriptional differences directly contribute to functional outcomes remains to be determined.

      All experiments in this study were conducted using early-passage ESCs and TSCs (i.e. Finally, we showed that close to 90% mutant ESCs are KLF4+ (a naive pluripotency marker) whereas EOMES+ cells were significantly reduced in TSCs carrying the GIN genotype (Fig. 4E–F and Supplementary Fig. 7), highlighting lineage-specific differences.

      Minor Comments:

      1) There is a lack of quantification and repeats for all Westerns. At minimum there should be three repeats for each experiment, quantification including normalization to a reference protein, and stats confirming any proposed differences between conditions.

      We will update our revision with quantification and statistics for western blots.

      2) I would recommend moving the results in supp table 1 to figure 1. While negative, they are the newer results. The results shown in current figure 1 are essentially a reproduction of their previous work.

      The placental observations presented in Fig.1 are new. In particular, the placental and embryonic weight measurements graphed in Fig1B and C have not been published by our group. Fig1A reproduces our previous observation on embryo viability in GIN mutants (McNairn et al., Nature 2019), while the schematic was provided for better flow and readability given the complex mating schemes. We are agnostic on the Suppl Table 1. It could be changed to a new Table 1 in the main section depending on the journal.

      In response to reviewer #3 comments:

      Major Comments

      While the inclusion of bulk RNAseq data of whole placental tissue is appreciated, the interpretation of the results is somewhat problematic, as it is acknowledged that the cell type composition of the placentas is drastically different between groups. Making conclusions based upon GSEA analysis of two different groups with drastically different cell type composition is somewhat misleading, as based on the results, it is a direct reflection of the cell types present. It would be more helpful to perform cell type deconvolution of the RNAseq data to estimate the proportion of each cell type within the bulk samples and compare that to what is seen histologically and not dive too deeply into the pathways since the results could just be a reflection of the cell types e.g. angiogenesis pathways from more endothelial cells. Additionally, the RNAseq data can be leveraged to look at expression of inflammatory genes by sex, which may show interesting patterns based on the other results.

      We agree that the representation of cell types in the placenta is problematic especially for underrepresented genes. We propose to use the BayesPrism tool (Chu et al., Nat Cancer 2022) to deconvolute bulk RNA-seq for better representation of transcriptional changes in the placenta.

      Section: GIN impairs trophoblast stem cell establishment and maintenance. To support the assertion in the first paragraph, beyond measuring apoptosis, it would be helpful at this stage to look at RNA expression levels indicative of the activation of DNA damage checkpoint genes

      We have performed RNA-seq on mutant ESC and TSCs and are in the process of data analysis. We will update these results in the revision.

      Please include additional methodological details in the methods section on the statistical analysis done for differential expression analysis. Specifically, what type of normalization was used, if lowly expressed genes were filtered out and at what cutoff, what statistical model was used (did you include covariates?), what comparisons were made? Did you stratify by sex? What cutoff was used for statistical significance? Did you perform multiple testing correction?

      We will update RNA-Seq data analysis methods in our full revision.

      2. Description of the revisions that have already been incorporated in the transferred manuscript

      Reviewer #1 comments:

      • Supplementary Table 1. would be enhanced greatly showing comparable tables for Mcm4C3/C3 x Mcm4C3/+McmGt/+ in mice without the Tmem173 or Ddx58 mutations. It is fine to recycle data from McNairn 2019 here, as long as the source is indicated, but a comparison is needed.

      Thanks for pointing this out. We have updated this suggestion in Supp table 1.

      • In Figure S3E-F, is the box above each graph supposed to show the genotype of the dam?

      Yes. Thanks for pointing this out. We have added a description in the figure legend to make it clear.

      • "Indeed, the placenta and embryo weights of E13.5 Mcm4C3/C3 Mcm2Gt/+ Mcm3Gt/+ animals were significantly improved vs. Mcm4C3/C3 Mcm2Gt/+ animals, rendering them similar to Mcm4C3/C3 littermates (Fig. 6A-C). The JZ (but not LZ) area in Mcm4C3/C3 Mcm2Gt/+ Mcm3Gt/+ placentae also increased to the level of Mcm4C3/C3 littermates (Fig. 6D-H)." There are two problems here. First, the figure calls are wrong. Second, the description of the data is not quite right, it looks like the C3/C3 and C3/C3 M2/+ M3/+ LZs are a similar size to each and are statistically indistinguishable.

      Thanks for catching this. We have updated these in the main text.

      *Reviewer #2 comments: *

      Minor comment

      • Need to review citations to figures. For example, no citations are made to figure 4a and 4c.

      Thanks for catching this. We have updated the text.

      Reviewer #3 comments:

      Define the first use of >4C DNA content to help readers understand this potentially unfamiliar term.

      We have edited this part to indicate cells with more than 4C DNA content for better clarity.

      iDEP tool - please include citation to manuscript instead of link

      We have updated this citation.

      Check citations. Some citations to BioRxiv that are now published e.g. 13.

      We have updated this citation.

      3. Description of analyses that authors prefer not to carry out

      Reviewer 2

      2) Along similar lines, most of the in vivo phenotypic analyses are performed at E13.5, long after defects are likely beginning to express themselves especially given that they see phenotypes in the TSCs, which represent the polar TE of a E4.5. To understand the primary defects of the in vivo phenotype, they should be looking much earlier. Supplemental figure 5 is a start but represents a rather superficial analysis.

      The peri-implantation period, namely E4.5, represents a “black box” of embryonic development given that this is a critical stage for implantation. Aside from being an extremely difficult stage to analyze technically, we don’t think it is essential to the conclusions (or doable in a timely manner), especially given the use of TSCs. If we complete EdU studies on E6.5 embryos, we will include them.

      3) Fig. 6 would benefit from evidence that MCM3 mutant is rescuing MCM4 levels in the chromatin fraction of cells and the DNA damage phenotype.

      The genetic evidence presented is strong, and although we didn’t do the suggested experiment, we feel that our previous studies (McNairn et al., Nature 2019 and Chuang et al., PLoS Genet 2010) on the effects of MCM3 as a nuclear export factor (as it is in yeast (Liku et al., Mol Biol Cell 2005)) are a reasonable basis for not repeating such experiments. Furthermore, we are no longer maintaining the Mcm3 line and it would take over a year to reconstitute and rebreed triple mutants.

    1. Reviewer #3 (Public review):

      Summary:

      Lmx1a is an orthologue of apterous in flies, which is important for dorsal-ventral border formation in the wing disc. Previously, this research group has described the importance of the chicken Lmx1b in establishing the boundary between sensory and non-sensory domains in the chicken inner ear. Here, the authors described a series of cellular changes during border formation in the chicken inner ear, including alignment of cells at the apical border and concomitant constriction basally. The authors extended these observations to the mouse inner ear and showed that these morphological changes occurred at the border of Lmx1a positive and negative regions, and these changes failed to develop in Lmx1a mutants. Furthermore, the authors demonstrated that the ROCK-dependent actomyosin contractility is important for this border formation and blocking ROCK function affected epithelial basal constriction and border formation in both in vitro and in vivo systems.

      Strengths:

      The morphological changes described during border formation in the developing inner ear are interesting. Linking these changes to the function of Lmx1a and ROCK dependent actomyosin contractile function are provocative.

      Weaknesses:

      There are several outstanding issues that need to be clarified before one can pin the morphological changes observed being causal to border formation and that Lmx1a and ROCK are involved.

      Comments on the latest version:

      The revised manuscript has provided clarity of their results on some levels, but unfortunately, the basal restriction during border formation remains unclear and the study did not advance the understanding of role of Lmx1a in boundary formation. Overall comments are indicated below:

      (1) The authors states in the rebuttal, "we do not think that ROCK activity is required for the formation or maintenance of the basal constriction at the interface of Lmx1a-expressing and non-expressing cells"<br /> If the above is the sentiment of the authors, then the manuscript is not written to support this sentiment clearly, starting with this misleading sentence in the Abstract, "The boundary domain is absent in Lmx1a-deficient mice, which exhibit defects in sensory organ segregation, and is disrupted by the inhibition of ROCK-dependent actomyosin contractility."

      (2) As acknowledged by the authors, the data as they currently stand could be explained by Lmx1a functioning in specifying the non-sensory fate and may not function directly in boundary formation. With this caveat in mind, the role of Lmx1a in boundary formation remains unclear.

      (3) I feel like the word "orchestrate" in the title is an overstatement.

    1. Author response:

      The following is the authors’ response to the original reviews

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      The manuscript by Raices et al., provides novel insights into the role and interactions between SPO-11 accessory proteins in C. elegans. The authors propose a model of meiotic DSBs regulation, critical to our understanding of DSB formation and ultimately crossover regulation and accurate chromosome segregation. The work also emphasizes the commonalities and species-specific aspects of DSB regulation.

      Strengths:

      This study capitalizes on the strengths of the C. elegans system to uncover genetic interactions between a large number of SPO-11 accessory proteins. In combination with physical interactions, the authors synthesize their findings into a model, which will serve as the basis for future work, to determine mechanisms of DSB regulation.

      Weaknesses:

      The methodology, although standard, lacks quantification. This includes the mass spectrometry data , along with the cytology. The work would also benefit from clarifying the role of the DSB machinery on the X chromosome versus the autosomes.

      • We have uploaded the MS data and added a summary table with the number of peptides and coverage.

      • We have added statistics to the comparisons of DAPI body counts.

      • We have provided additional images of the change in HIM-5 localization

      • We have quantified the overlap (or lack thereof) between XND-1 and HIM-17 and the DNA axis

      Reviewer #2 (Public Review):

      Summary:

      Meiotic recombination initiates with the formation of DNA double-strand break (DSB) formation, catalyzed by the conserved topoisomerase-like enzyme Spo11. Spo11 requires accessory factors that are poorly conserved across eukaryotes. Previous genetic studies have identified several proteins required for DSB formation in C. elegans to varying degrees; however, how these proteins interact with each other to recruit the DSB-forming machinery to chromosome axes remains unclear.

      In this study, Raices et al. characterized the biochemical and genetic interactions among proteins that are known to promote DSB formation during C. elegans meiosis. The authors examined pairwise interactions using yeast two-hybrid (Y2H) and co-immunoprecipitation and revealed an interaction between a chromatin-associated protein HIM-17 and a transcription factor XND-1. They further confirmed the previously known interaction between DSB-1 and SPO-11 and showed that DSB-1 also interacts with a nematodespecific HIM-5, which is essential for DSB formation on the X chromosome. They also assessed genetic interactions among these proteins, categorizing them into four epistasis groups by comparing phenotypes in double vs. single mutants. Combining these results, the authors proposed a model of how these proteins interact with chromatin loops and are recruited to chromosome axes, offering insights into the process in C. elegans compared to other organisms.

      Weaknesses:

      This work relies heavily on Y2H, which is notorious for having high rates of false positives and false negatives. Although the interactions between HIM-17 and XND-1 and between DSB-1 and HIM-5 were validated by co-IP, the significance of these interactions was not tested, and cataloging Y2H interactions does not yield much more insight.

      We appreciate that the reviewer recognized the value of our IP data, but we beg to differ that we rely too heavily on the Y2H. We also provide genetic analysis on bivalent formation to support the physical interaction data. We do acknowledge that there are caveats with Y2H, however, including that a subset of the interactions can only be examined with proteins in one orientation due to auto-activation. While we acknowledge that it would be nice to have IP data for all of the proteins using CRISPR-tagged, functional alleles, these strains are not all feasible (e.g. no functional rec-1 tag has been made) and are beyond the scope of the current work.

      Moreover, most experiments lack rigor, which raises serious concerns about whether the data convincingly supports the conclusions of this paper. For instance, the XND-1 antibody appears to detect a band in the control IP; however, there was no mention of the specificity of this antibody.

      We previously showed the specificity of this antibody in its original publication showing lack of staining in the xnd-1 mutant by IF (Wagner et al., 2010). To further address this, however, we have now included a new supplementary figure (Figure S1) demonstrating the specificity of the XND-1 antibody by Western blot. The antibody detects a distinct band in extracts from wild-type (N2) worms, but this band is absent in two independent xnd-1 mutant strains. This confirms that the antibody specifically recognizes XND-1, supporting the validity of the IP results shown in the main figures.

      Additionally, epistasis analysis of various genetic mutants is based on the quantification of DAPI bodies in diakinesis oocytes, but the comparisons were made without statistical analyses.

      We have added statistical analysis to all datasets where quantification was possible, strengthening the rigor and interpretation of our findings.

      For cytological data, a single representative nucleus was shown without quantification and rigorous analysis. The rationale for some experiments is also questionable (e.g. the rescue by dsb-2 mutants by him-5 transgenes in Figure 2), making the interpretation of the data unclear. Overall, while this paper claims to present "the first comprehensive model of DSB regulation in a metazoan", cataloging Y2H and genetic interactions did not yield any new insights into DSB formation without rigorous testing of their significance in vivo. The model proposed in Figure 4 is also highly speculative.

      Regarding the cytology, we provide new images and quantification of HIM-17 and XND-1 overlap with the DNA axes. We also added full germ line images showing HIM-5 localization in wild type and dsb-1 mutants, to provide a more complete and representative view of the observed phenotype. To further support our findings, we’ve also included images demonstrating that this phenotype is consistently observed with both in live worm with the the him-5::GFP transgene and in fixed worms with an endogenously tagged version of HIM-5.

      Reviewer #3 (Public Review):

      During meiosis in sexually reproducing organisms, double-strand breaks are induced by a topoisomerase-related enzyme, Spo11, which is essential for homologous recombination, which in turn is required for accurate chromosome segregation. Additional factors control the number and genome-wide distribution of breaks, but the mechanisms that determine both the frequency and preferred location of meiotic DSBs remain only partially understood in any organism.

      The manuscript presents a variety of different analyses that include variable subsets of putative DSB factors. It would be much easier to follow if the analyses had been more systematically applied. It is perplexing that several factors known to be essential for DSB formation (e.g., cohesins, HORMA proteins) are excluded from this analysis, while it includes several others that probably do not directly contribute to DSB formation (XND-1, HIM-17, CEP-1, and PARG-1).

      We respectfully disagree with the reviewer’s statement regarding the selection of factors included in our analysis. In this work, our focus was specifically on SPO-11 accessory factors — proteins that directly interact with or regulate SPO-11 activity during doublestrand break formation. Cohesins and chromosome axis proteins (such as the HORMA domain proteins) are essential for establishing the correct chromosome architecture that supports DSB formation, but there is no evidence that they are direct accessory factors of SPO-11. Therefore, they were intentionally excluded from this study to maintain a clear and focused scope on proteins that more directly modulate SPO-11 function.

      Conversely, XND-1, HIM-17, CEP-1, and PARG-1 have all been implicated in regulating aspects of SPO-11-mediated DSB formation or its immediate environment. Although their contributions mayinvolve broader chromatin or DNA damage response regulation, prior literature supports their inclusion as relevant modulators of SPO-11 activity, justifying their analysis within the context of this work.

      The strongest claims seem to be that "HIM-5 is the determinant of X-chromosome-specific crossovers" and "HIM-5 coordinates the actions of the different accessory factors subgroups." Prior work had already shown that mutations in him-5 preferentially reduce meiotic DSBs on the X chromosome. While it is possible that HIM-5 plays a direct role in DSB induction on the X chromosome, the evidence presented here does not strongly support this conclusion. It is also difficult to reconcile this idea with evidence from prior studies that him-5 mutations predominantly prevent DSB formation on the sex chromosomes, while the protein localizes to autosomes.

      HIM-5 is not the only protein that is autosomally enriched but preferentially affects the X chromosome: MES-4 and MRG-1 are both autosomally-enriched but influence silencing of the X chromosome. While HIM-5 appears autosomally-enriched, it does not appear to be autosomal-exclusive. While we would ideally perform ChIP to determine its localization on chromatin, this method for assaying DSB sites is likely insufficient to identify DSB sites which differ in each nucleus and for which there are no known hotspots in the worm.

      him-5 mutants confer an ~50% reduction in total number of breaks and a very profound change in break dynamics (seen by RAD-51 foci (Meneely et al., 2012)). Since the autosomes receives sufficient breaks in this context to attain a crossover in >98% of nuclei, this indicates that the autosomes are much less profoundly impacted by loss of DSB functions than is the X chromosome. Indeed, prior data from co-author, Monica Colaiacovo, showed that fewer breaks occur on the X (Gao, 2015) likely resulting from differences in the chromatin composition of the X and autosome resulting from X chromosome silencing.

      The conclusion that HIM-5 must be required for breaks on the X comes from the examination of DSB levels and their localization in different mutants that impair but do not completely abrogate breaks. In any situation where HIM-5 protein expression is affected (xnd-1, him-17, and him-5 null alleles), breaks on the X are reduced/ eliminated. By contrast, in dsb-2 mutants, where HIM-5 expression is unaffected, both X and autosomal breaks are impacted equally. As discussed above, in the absence of HIM-5 function, there are ~15 breaks/ nucleus. The Ppie1::him-5 transgene is expressed to lower levels than Phim-5::him-5, but in the best case, the ectopic expression of this protein should give a maximum of ~15 breaks (the total # of breaks is thought to be ~30/nucleus). By these estimates, Ppie-1::him-5; him-17 and him-5 null mutants have the same number of breaks. Yet, in the former case, breaks occur on the X; whereas in the latter they do not. The best explanation for this discrepancy is that HIM-5 is sufficient to recruits the DSB machinery to the X chromosome.

      The one experiment that seems to elicit the conclusion that HIM-5 expression is sufficient for breaks on the X chromosome is flawed (see below). The conclusion that HIM-5 "coordinates the activities of the different accessory sub-groups" is not supported by data presented here or elsewhere.

      We have reorganized the discussion to more directly address the reviewers’ concerns. We raise the possibility that HIM-5 has an important role in bringing together the SPO-11 and its interacting components (DSB-1/2/3) with the other DSB inducing factors, including those factors that regulating DSB timing (XND-1), coordination with the cell cycle (REC-1), association with the chromosome axis (PARG-1, MRE-11), and coupling to downstream resection and repair (MRE-11, CEP-1).  

      This raises a natural question: if HIM-5 has such a central role, why are the phenotypes of HIM-5 so mild? We propose that while the loss of DSBs on the X appears mild, more profound effects are seen in the total number, timing, and placement of the DSBs across the genome- all of which are diminished or altered in the absence of HIM-5. The phenotypes of him-5 loss reminiscent of those observed in Prdm9-/- in mice where breaks are relocated to transcriptional start sites and show significant delay in formation. As with PRDM9, the comparatively subtle phenotypes of HIM-5 loss do not diminish its critical role in promoting proper DSB formation in most mammals.

      Like most other studies that have examined DSB formation in C. elegans, this work relies on indirect assays, here limited to the cytological appearance of RAD-51 foci and bivalent chromosomes, as evidence of break formation or lack thereof. Unfortunately, neither of these assays has the power to reveal the genome-wide distribution or number of breaks. These assays have additional caveats, due to the fact that RAD-51 association with recombination intermediates and successful crossover formation both require multiple steps downstream of DSB induction, some of which are likely impaired in some of the mutants analyzed here. This severely limits the conclusions that can be drawn. Given that the goal of the work is to understand the effects of individual factors on DSB induction, direct physical assays for DSBs should be applied; many such assays have been developed and used successfully in other organisms.

      We appreciate the reviewer’s thoughtful comments. We agree that RAD-51 foci are an indirect readout of DSB formation and that their dynamics can be influenced by defects in downstream repair processes. However, in C. elegans, the available methods for directly detecting DSBs are limited. Unlike other organisms, C. elegans lacks γH2AX, eliminating the possibility of using γH2AX as a DSB marker. TUNEL assays, while conceptually appealing, have proven unreliable and poorly reproducible in the germline context. Similarly, RPA foci do not consistently correlate with the number of DSBs and are influenced by additional processing steps.

      Given these limitations, RAD-51 foci remain the most widely accepted surrogate for monitoring DSB formation in C. elegans. While we fully acknowledge the caveats associated with this approach — particularly the potential effects of downstream repair defects — RAD-51 analysis continues to provide valuable insight into DSB dynamics and regulation, especially when interpreted in combination with other phenotypic assessments.

      Throughout the manuscript, the writing conflates the roles played by different factors that affect DSB formation in very different ways. XND-1 and HIM-17 have previously been shown to be transcription factors that promote the expression of many germline genes, including genes encoding proteins that directly promote DSBs. Mutations in either xnd-1 or him-17 result in dysregulation of germline gene expression and pleiotropic defects in meiosis and fertility, including changes in chromatin structure, dysregulation of meiotic progression, and (for xnd-1) progressive loss of germline immortality. It is thus misleading to refer to HIM-17 and XND-1 as DSB "accessory factors" or to lump their activities with those of other proteins that are likely to play more direct roles in DSB induction.

      It is clear that we will not reach agreement about the direct vs indirect roles here of chromatin remodelers/transcription factors in break formation. In yeast, there is a precedent for SPP1 and in mouse for Prdm9, both of which could be described as transcription factors as well, as having roles in break formation by creating an open chromatin environment for the break machinery. We envision that these proteins function in the same fashion. The changes in histone acetylation in the xnd-1 mutants supports such a claim.

      We do not know what the reviewer is referring to in statement that “XND-1 and HIM-17 have previously been shown to be transcription factors that promote the expression of many germline genes.” While the Carelli et al paper indeed shows a role for HIM-17 in expression of many germline genes, there is only one reference to XND-1 in this manuscript (Figure S3A) which shows that half of XND-1 binding sites overlap with the co-opted germline promoters. There is no transcriptional data at all on xnd-1 mutants, save our studies (referenced herein) that XND-1 regulates him-5 expression.

      For example, statements such as the following sentence in the Introduction should be omitted or explained more clearly: "xnd-1 is also unique among the accessory factors in influencing the timing of DSBs; in the absence of xnd-1, there is precocious and rapid accumulation of DSBs as monitored by the accumulation of the HR strand-exchange protein RAD-51.

      We are not sure what is confusing here. The distribution of RAD-51 foci is significantly altered in xnd-1 mutants and peak levels of breaks are achieved as nuclei leave the transition zone (Wagner et al., 2010; McClendon et al., 2016). There is no other mutation that causes this type of change in RAD-51 distribution.

      "The evidence that HIM-17 promotes the expression of him-5 presented here corroborates data from other publications, notably the recent work of Carelli et al. (2022), but this conclusion should not be presented as novel here.

      We have clarified this in the text. We note that this paper showed alterations in him-5 levels by RNA-Seq but they did not validate these results with quantitative RT-PCR. Thus, our studies do provide an important validation of their prior results.

      The other factors also fall into several different functional classes, some of which are relatively well understood, based largely on studies in other organisms. The roles of RAD50 and MRE-11 in DSB induction have been investigated in yeast and other organisms as well as in several prior studies in C. elegans. DSB-1, DSB-2, and DSB-3 are homologs of relatively well-studied meiotic proteins in other organisms (Rec114 and Mei4) that directly promote the activity of Spo11, although the mechanism by which they do so is still unclear.

      Whilst we agree that we understand some of the functions of the homologs, there are clearly examples in other processes of conserved proteins adopting unique regulatory function. We should not presume evolutionary conservation until proven. Indeed the comparison between the Mer2 proteins becomes particularly relevant here. For example, the RMM complex in plants does not contain PRD3, although this protein is thought to have function in DSB formation and repair (Lambing et al, 2022; Vrielynck et al., 2021; Thangavel et al., 2023). In Sordaria, as well, the Mer2 homolog has distinct functions (Tesse et al., 2017).  

      Mutations in PARG-1 (a Poly-ADP ribose glycohydrolase) likely affect the regulation of polyADP-ribose addition and removal at sites of DSBs, which in turn are thought to regulate chromatin structure and recruitment of repair factors; however, there is no convincing evidence that PARG-1 directly affects break formation.

      Our prior collaborative studies on PARG-1 showed that is has a non-catalytic function that promote DSBs that is independent of accumulation of PAR (Janisiw et al., 2020; Trivedi et al., 2022)

      CEP-1 is a homolog of p53 and is involved in the DNA damage response in the germline, but again is unlikely to directly contribute to DSB induction.

      We respectfully disagree with the reviewer’s statement. While CEP-1 is indeed a homolog of p53 and plays a major role in the DNA damage response, prior work from Brent Derry’s lab and from our group (Mateo et al., 2016) demonstrated that specific cep-1 separationof-function alleles affect DSB induction and/or repair pathway choice independently of canonical DNA damage checkpoint activation. In particular, defects in DSB formation observed in certain cep-1 mutants can be rescued by exogenous irradiation, supporting a direct or closely linked role in promoting DSB formation rather than merely responding to damage. Thus, based on these functional data, we considered CEP-1 a relevant factor to include in our analysis. We have now clarified this rationale in the revised manuscript.

      HIM-5 and REC-1 do not have apparent homologs in other organisms and play poorly understood roles in promoting DSB induction. A mechanistic understanding of their functions would be of value to the field, but the current work does not shed light on this. A previous paper (Chung et al. G&D 2015) concluded that HIM-5 and REC-1 are paralogs arising from a recent gene duplication, based on genetic evidence for a partially overlapping role in DSB induction, as well as an argument based on the genomic location of these genes in different species; however, these proteins lack any detectable sequence homology and their predicted structures are also dissimilar (both are largely unstructured but REC-1 contains a predicted helical bundle lacking in HIM-5). Moreover, the data presented here do not reveal overlapping sets of genetic or physical interactions for the two genes/proteins. Thus, this earlier conclusion was likely incorrect, and this idea should not be restated uncritically here or used as a basis to interpret phenotypes.

      Actually, there is quite good bioinformatic analysis that the rec-1 and him-5 loci evolved from a gene duplication and that each share features of the ancestral protein (Chung et al., 2015). We are sorry if the reviewer casts aspersions on the prior literature and analyses. The homology between these genes with the ancestral protein is near the same degree as dsb-1, dsb-2, or dsb-3 to their ancestral homologs (<17%).

      DSB-1 was previously reported to be strictly required for all DSB and CO formation in C. elegans. Here the authors test whether the expression of HIM-5 from the pie-1 promoter can rescue DSB formation in dsb-1 mutants, and claim to see some rescue, based on an increase in the number of nuclei with one apparent bivalent (Figure 2C). This result seems to be the basis for the claim that HIM-5 coordinates the activities of other DSB proteins. However, this assay is not informative, and the conclusion is almost certainly incorrect. Notably, a substantial number of nuclei in the dsb-1 mutant (without Ppie-1::him-5) are reported as displaying a single bivalent (11 DAPI staining bodies) despite prior evidence that DSBs are absent in dsb-1 mutants; this suggests that the way the assay was performed resulted in false positives (bivalents that are not actually bivalents), likely due to inclusion of nuclei in which univalents could not be unambiguously resolved in the microscope. A slightly higher level of nuclei with a single unresolved pair of chromosomes in the dsb-1; Ppie-1::him-5 strain is thus not convincing evidence for rescue of DSBs/CO formation, and no evidence is presented that these putative COs are X-specific. The authors should provide additional experimental evidence - e.g., detection of RAD-51 and/or COSA-1 foci or genetic evidence of recombination - or remove this claim. The evidence that expression of Ppie-1::him-5 may partially rescue DSB abundance in dsb-2 mutants is hard to interpret since it is currently unknown why C. elegans expresses 2 paralogs of Rec114 (DSB-1 and DSB-2), and the age-dependent reduction of DSBs in dsb-2 mutants is not understood.

      We have removed this claim in part because we have been unable to create the triple mutants strains to analyze COSA-1 foci.

      To the point about 11 vs 12 DAPI bodies: the literature is actually replete with examples of 11 DAPI bodies vs 12 in mutants with no breaks:

      Hinman al., 2021: null allele of dsb-3 has an average of 11.6 +/- 0.6 breaks;

      Stamper et al, 2013, show just over 60% of dsb-1 nuclei with 12 DAPI bodies and 5-10% with 10 DAPI bodies. (Figure 1);

      In addition, we also previously showed (Machovina et al., 2016) that a subset of meiotic nuclei have a single RAD-51 focus and can achieve a crossover. RAD-51 foci in spo-11 were also reported in Colaiacovo et al., 2003.

      Several of the factors analyzed here, including XND-1, HIM-17, HIM-5, DSB-1, DSB-2, and DSB-3, have been shown to localize broadly to chromatin in meiotic cells. Coimmunoprecipitation of pairs of these factors, even following benzonase digestion, is not strong evidence to support a direct physical interaction between proteins.

      Similarly, the super-resolution analysis of XND-1 and HIM-17 (Figure 1EF) does not reveal whether these proteins physically interact with each other, and does not add to our understanding of these proteins functions, since they are already known to bind to many of the same promoters. Promoters are also likely to be located in chromatin loops away from the chromosome axis, so in this respect, the localization data are also confirmatory rather than novel.

      While the binding to promoters would be expected to be on DNA loops, that has not been definitively shown in the worm germ line. The supplemental data of the Carelli paper suggests that there are ~250 binding sites for each protein at these coopted promoters. This could not account for crossover map seen in C. elegans.

      The reviewer states correct that we do not reveal that these proteins interact, but we have shown that the two proteins co-IP and have a Y2H interaction. This interaction is supporedt by a recent publication (Blazickova et al., 2025) corroborating this conclusion and identifies XND-1 in HIM-17 co-IPs also in the presence of benzonase. We do now show, however, by immuno-localization that the two proteins appear to be adjacent, but nonoverlapping. As now described in the text, AlphaFold 3 modeling and structural analysis suggests that the two proteins do interact directly and that the tagged 5’ end of HIM-17 used in our studies is likely to be at least 200nm from the putative XND-1 binding interface, a distance that is consistent with our confocal images showing frequent juxtaposition of the two proteins.

      The phenotypic analysis of double mutant combinations does not seem informative. A major problem is that these different strains were only assayed for bivalent formation, which (as mentioned above) requires several steps downstream of DSB induction. Additionally, the basis for many of the single mutant phenotypes is not well understood, making it particularly challenging to interpret the effects of double mutants. Further, some of the interactions described as "synergistic" appear to be additive, not synergistic. While additive effects can be used as evidence that two genes work in different pathways, this can also be very misleading, especially when the function of individual proteins is unknown. I find that the classification of genes into "epistastasis groups" based on this analysis does not shed light on their functions and indeed seems in some cases to contradict what is known about their functions. ‘

      As described above, each of the proteins analyzed is thought to have a direct role in regulating meiotic DSB formation and single mutant phenotypes are consistent with this interpretation. In almost all-if not all- of these cases, IR induced breaks suppress univalent phenotypes (or uncover a downstream repair defect (e.g. in mre-11)) supporting this conclusion. We have changed the terminology from “epistasis groups” since this is not strict epistasis, but rather, “functional groups”.  

      The yeast two-hybrid (Y2H) data are only presented as a single colony. While it is understandable to use a 'representative' colony, it is ideal to include a dilution series for the various interactions, which is how Y2H data are typically shown.

      The Y2H data are presented as spots on a plate and are from three to four individual transformants per interaction tested, and are not individual colonies. The experiment was repeated in triplicate from different transformations. We have now made this clearer in the materials and methods section. This approach has been successfully used to examine protein interactions in our prior manuscripts of yeast and human proteins [Gaines et al (2015) Nat. Comms 6:7834; Kondrashova et al (2017) Cancer Discovery 7:984; Garcin et al (2019) PLoS Genetics 15:e1008355; Bonilla et al (2021) eLife 1: e68080) Prakash et al (2022) PNAS 119: e2202727119, etc]

      Additional (relatively minor) concerns about these data:

      (1) Several interactions reported here seem to be detected in only one direction - e.g., MRE-11-AD/HIM-5-BD, REC-1-AD/XND-1-BD, and XND-1-AD/HIM-17-BD - while no interactions are seen with the reciprocal pairs of fusion proteins. I'm not sure if some of this is due to pasting "positive" colony images into the wrong position in the grid, but this should be addressed.

      The asymmetry in the interactions observed is due to the well-known phenomenon in yeast two-hybrid (Y2H) assays where certain plasmids exhibit self-activation when fused in one orientation, making interpretation of reciprocal interactions challenging. In our experiment, some of the plasmids indeed showed self-activation in one direction, which likely accounts for the lack of interaction seen with the reciprocal pairs of fusion proteins. We have clarified this point in the Methods.

      (2) DSB-3 was only assayed in pairwise combinations with a subset of other proteins; this should be explained; it is also unclear why the interaction grids are not symmetrical about the diagonal.

      We have now completed the analysis by adding the interactions of DSB-3 with the remaining proteins that were missing from the initial set.

      (3) I don't understand why the graphic summaries of Y2H data are split among 3 different figures (1, 2, and 3).

      We chose to split the graphic summaries of the Y2H data across Figures 1, 2, and 3 because we felt this organization better aligns with the flow of the results presented in each figure. Each set of interactions is shown in the context of the specific experiments and findings discussed in those sections, which we believe helps provide a clearer and more logical presentation of the data.

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      Figure 1: B) The IP is difficult to interpret - there is a band of the corresponding size to XND-1 in the control lane calling into question the specificity of the IP/Western.

      We added a supplemental figure with the specificity of the antibody showing that there is a background non-specific band.

      C) More information about the mass spectrometry should be included. No indication of the number of times a peptide was identified, or the overall coverage of the identified proteins.

      Done

      This is important as in the results section (line 114) the authors indicate that there was "strong" interaction yet there is no way to assess this.

      D) Why wasn't hatching measured in the him-5p::him-5; him-17(ok424) strain?

      Great question. I guess we need to do this while back out for review. If anyone has suggestions of what to say here. Clearly we overlooked this point but do have the strain.

      E) Quantification of the cytology should be included.

      We have now quantified overlap between XND-1 and HIM-17

      Figure 2: C) Statistics should be included.

      Done

      E) Quantification should be included for the cytology. I recommend changing the eals15 to HIM-5.

      We included better images showing whole gonads instead of one or two nuclei. We were not sure what the reviewers want us to quantify here since the relocalization of the protein to the cytoplasm is very clear.

      I have a general issue with the use of the term epistasis - this is used to order gene function based on different mutant phenotypes, usually with null alleles. While I think the authors have valid points with how they group the different SPO-11 accessory proteins, I do not think they should use the word epistasis, but rather genetic interactions.

      We appreciate the reviewers thoughts on this matter and have removed the term epistasis and use functional groups or genetic interactions throughout the text.

      Figure 4 and the nature of the X chromosome: First, I think it would help the non-C. elegans reader to include a little more information on the X chromosome with respect to its differences compared to the autosomes. I also think that, if possible, it would be beneficial to include a model of the X in Figure 4.

      We have added more about X/autosome differences in the intro and during the discussion of HIM-5 function and have added a figure showing difference in the behavior of the X/autosomes during DSB/crossover formation.

      Minor points:

      Abstract: Given the findings of Silva and Smolikove on SPO-11 breaks, I recommend removing "early" from line 28 in the Abstract.

      Done

      Introduction (line 93): I think "biochemical studies" is a stretch here - I recommend "interaction studies".

      Done

      Results: (lines 160-161): mutations are not required for breaks. Line 172, there is a problem with the sentence.

      Corrected

      Reviewer #2 (Recommendations For The Authors):

      Major comments:

      (1) Figure 1B- The signal for XND-1 seems to appear both in the control and him-17::HA IP. Do the authors have tested the specificity of the XND-1 antibody?

      We included a supplementary figure demonstrating the specificity of the XND-1 antibody by Western blot. This was also previously published (Wagner et al., 2010)

      (2) Figure 1D - can the authors provide an explanation why the him-5p::him-5 transgene that drives a higher expression than pie-1p::him-5 fails to suppress the Him phenotype seen in him-17? What are the HIM-5 levels like in these two strains compared to N2 and him-17 null mutants? Can this information provide explanation for the differential effect of the him-5 transgenes?

      We previously reported that him-5p::him-5 drives higher expression than pie-1p::him-5 (McClendon et al, 2016).

      The reason that him-5p::him-5 does not rescue, despite higher wild type expression is that HIM-17 directly regulates expression of him-5. Since HIM-17 does not regulate the pie-1 promoter, the pie-1p::him-5 construct can at least partially suppress the him-17 mutation.

      We have (hopefully) explained this better in the text.  

      (3) Line 102- the subheading "HIM-5 is the essential factor for meiotic breaks in the Xchromosome" may not be appropriate for this section. This is what has previously been known. However, the results in Figure 1 demonstrate that a him-5 transgene can partially rescue the him-17 and ¬xnd-1 phenotype, but not that it is essential for meiotic DSB formation on X chromosomes.

      We think some of the concern here is sematic and have changed the phraseology to say that HIM-5 is SUFFICIENT for DSBs on the X… which had not previously been shown.

      Vis-à-vis the X chromosome, in all genetic backgrounds examined, the absence of HIM-5 consistently results in a complete lack of DSBs on the X. For instance, in dsb-2 mutants— where HIM-5 is still expressed—DSBs are reduced genome-wide, but the X chromosome occasionally retains breaks. In contrast, even a weak allele of him-17 results specifically in the loss of X chromosome breaks, underscoring a unique requirement for HIM-5 in promoting DSBs on the X. While Figure 1 shows that a him-5 transgene can partially rescue him-17 and xnd-1 phenotypes, the consistent observation that X breaks are absent without HIM-5 supports its classification as sufficient for DSB formation on the X chromosome.

      (4) Figure 1E - please consider enlarging the images and showing multiple examples.

      Done.

      I also suggest that the authors perform a more rigorous analysis to support the conclusion that XND-1 and HIM-17 localize away from the axis by quantifying multiple images and doing line-scan analysis.

      Provided. New images are provided in both, the main and supplemental figures, and quantification is included. There is no detectable overlap of the two protein with one another or the DNA axes (see quantification of overlap in Fig. 1).

      (5) Line 162 - This is the first mention of DSB-1, DSB-2, and DSB-3 in the paper. DSB-1 and DSB-2 are Rec114 homologs in C. elegans (Tesse et al., 2017), while DSB-3 is a homolog of Mei4 (Hinman et al., 2021). These proteins should be properly introduced in the introduction with appropriate citations.

      Done. We appreciate the reviewer pointing out that this was the first reference to these genes.

      (6) Line 169 - the rationale for this experiment is unclear. Why did the Y2H interaction between HIM-5 and DSB-1 prompt the authors to test the rescue of dsb-1 or dsb-2 phenotypes by the ectopic expression of him-5? Do the authors have evidence that HIM-5 level is reduced in dsb-1 or dsb-2 mutants?

      We have reorganized this section to better explain the motivation for looking at these interactions. We did see a difference in the localization in HIM-5 in the dsb-1 mutant animals and we did have a sense that HIM-5 was critical for breaks on the X. We reasoned that it could have independent functions in promoting breaks that were not yet appreciated so wanted to do this experiment.

      (7) Line 172 - "very slightly reduced". This claim requires statistical analysis.

      We added statistical analysis, but we also removed this claim.

      (8) Figures 2C and 2D - Can the authors provide an explanation why the pie-1p::him-5 transgene fails to suppress the phenotypes in dsb-1, while the him-5p::him-5 trasgene can? Again, the rationale for these experiments is unclear. Because of this, the interpretation is also unclear.

      The difference in rescue between the pie-1p::him-5 and him-5p::him-5 transgenes likely reflects differences in expression levels. As previously shown (McClendon et al., 2016), the him-5p::him-5 construct results in significantly higher expression of HIM-5 protein compared to pie-1p::him-5. This elevated expression likely explains its ability to partially rescue the dsb-1 phenotype. In contrast, the lower expression driven by the pie-1 promoter is insufficient to compensate for the absence of dsb-1 function. We have clarified the rationale and interpretation of these experiments in the revised manuscript to better reflect this point.

      (9) Lines 184-185 - the data for endogenously tagged HIM-5::3xHA are not shown anywhere in the paper. This must be shown.

      We have added this in the supplemental figures.

      (10) Figure 2D and 2E - what does the localization of pie-1p::him-5::GFP (eaIs15) and him5p::him-5::GFP (eaIs4) look like in wild-type and dsb-1 mutants? Are the cytoplasmic aggregates caused by increased levels of HIM-5 expression? Can the differential behavior of him-5 transgenes provide explanation for differential rescues?

      We now show both live and fixed images of Phim-5::him-5::gfp transgenes, as well as the localization of the endogenously HA-tagged HIM-5 locus (Figure 2 and S3). In all cases, the protein is initially nuclear and then absent from meiotic nuclei with similar timing. The Ppie1::him-5 transgene was very difficult to image due to low expression (even in wild type) so it not shown here. We presume it is the slightly elevated level of expression of the Phim5::him-5::gfp that can explain the differential rescue.

      (11) Lines 221-222, where are the results shown? Please refer to Figure S3.

      Done

      (12) Figure S3 - these need statistical analyses.

      Done

      (13) Lines 230-231 - what about the rec-1; parg-1; cep-1 triple mutant?

      This is an excellent suggestion and not one we have not yet pursued. Given the lack of strong phenotypes in all combination of double mutants, we prioritized other experiments . However, we agree that examining the rec-1; parg-1; cep-1 triple mutant would provide a valuable test of whether these factors act in the same pathway, and we appreciate the reviewer highlighting this potential future direction.

      (14) Line 298 - I suggest the authors take a look at the Alphafold prediction of DSB-1/DSB-2/DSB-3 and the comparison to human and budding yeast Rec114/Mei4 complex in Guo et al., 2022 eLife, which could provide insights into the Y2H results.

      We thank the reviewer for these comments and have indeed used these interactions and predicted homologies to zero in a region of interaction between these proteins that resembles what is seen in humans and yeast with a dimer of REC114 like proteins wraps stabilizing a central Mei4 helix . This is now shown in Figure 3H, I. Satisfyingly, this modeling predicts that a trimer comprised of 2 DSB-1 proteins with DSB-3 is more stable than a DSB1-DSB-2-DSB-3 trimer. This might explain why DSB-2 is not required in young adults and only becomes essential as DSB-1 levels drop in older animals (Rosu et al., 2013)

      (15) Can the authors introduce mutations within the DSB-1 interfaces that disrupt the interaction to either SPO-11 or DSB-2?

      We have begun to address this question by introducing targeted mutations within DSB-1. As shown in Figure 3E and 3F, mutations in the C-terminal region of DSB-1—which includes a core of four α-helices—disrupt its interaction with DSB-2 and DSB-3, but not with SPO-11. These findings suggest that the C-terminus mediates interactions specifically with DSB2 and DSB-3

      (16) Line 323 - The him-5 phenotypes are too weak to support the idea that it serves as the linchpin for the whole DSB complex. Do the authors have an explanation for why him-5 mutants exhibit X-chromosome-specific DSB defects?

      In response to the reviewer, above, and in the text, we have included a more detailed explanation of why we think HIM-5 has a key role in coordinating meiotic break formation. Although, identified for its role on the X, the phenotypes associated with DSB formation in the mutant are really quite pleiotropic and severe.

      (17) Line 436 - C. elegans lacks DSB hotspots.

      Removed

      Minor comments:

      (1) Figure 1A - please show the raw data for the yeast two-hybrid.

      We show representative yeast colonies in Figure S3.

      (2) It looks like the labeling for Figure 1B and 1C are switched.

      Fixed.

      (3) Figure 1B - what does the red box indicate? Please explain it in the legend.

      It indicates the XND-1 band. We added that information in the legend.

      (4) Figure 1C - in the legend, it was noted that the results are from GFP pulldowns of HIM17::GFP. However, the method for Figure 1B and the method section noted that HIM-17 was tagged with 3xHA, and the pull-down was performed using anti-HA affinity matrix. Please reconcile this discrepancy.

      That’s because they were done in two different sets of experiments. For the IPs we used a HIM-17::HA strain and for the MS, a HIM-17::GFP strain.

      (5) Also in Figure 1C - please call Table S2 in the main text when discussing the mass spec results. Also, it is not clear what HIM-17 and GFP indicate in the table. What makes CKU80 different from the other proteins listed under GFP? Please explain more clearly in the legend.

      We have move the table to supplemental data where we have included all of the peptide counts and gene coverage. We have included in the revised method rationale for inclusion in this table which explains why CKU-80 differs.

      (6) Line 527 - it is unclear what experiment was done for HIM-17. Please revise it to indicate that this is for "HIM-17 immunoprecipitation". Also please indicate the strain used for HIM17 pull-down (AV280?).

      (7) Line 113- please be specific about how the HIM-17 IP was performed. Which epitope and strains are used for pull-downs?

      This indeed was AV280. This has been added to the text and methods.

      (8) Figure 1D- What does ND mean? In the text, it was stated that there was only a minor suppression of hatching rates. The hatching rate for him-5p::him-5; him-17 must have been measured, and the data must be presented.

      ND does mean not determined. We have removed the statement about “minor suppression”. We only tested the overall population dynamics in the Phim-5::him-5;him17(ok424) and the DAPI body counts. The failure to suppress the latter suggests there would be no enect on hatching rates, although we did not test this directly. Since we had done this for the Ppie-1::him-5;him-17 strain, we provided this information to further support the claims of genetic rescue by ectopic expression.

      (9) Line 151 - please specify that STED was used.

      We have removed the STED images, and just show the confocal images with Lightning Processing.

      (10) Figure 1E- the authors suggested that HIM-17 and XND-1 mainly localize to autosomes but not the X chromosome. However, there is not enough evidence that the chromosome excluded from HIM-17 staining is indeed an X chromosome.

      (11) Figure 1E (Line 154) - what are the active chromatin markers examined? Where are the data?

      We have previously shown that the chromosome lacking XND-1 staining is the X (Wagner et al., 2010). The X is heterochromatic and chromatin marks associated with active transcription, including H3K4me3 and HTZ-1 (a variant H2A), preferentially localize to autosomes, effectively anti-marking the X chromosome. As shown in the new Figure 1E, a single chromosome has very little XND-1 and HIM-17 associated proteins. This is the X chromosome.

      (12) Line 172 - It should be a comma instead of the period after "In dsb-1 mutants".

      Fixed

      (13) Figure S3H-K - I suggest the authors indicate the alleles of mre-11 (null vs. iow1) on the graph, similarly to him-5(e1490) to avoid confusion.

      Done

      (14) Lines 294 and 600 - Guo et al. 2022 is now published in eLife. The authors must cite the published paper, not the preprint.

      Fixed

      (15) Line 407 - the reference Carelli et al., 2022 is missing.

      Added

      (16) Line 766 - please remove "is" before nuclear.

      Done

      Reviewer #3 (Recommendations For The Authors):

      Major issues:

      In my view, the most interesting mechanistic finding in the paper is the evidence that HIM-5 may not bind to chromatin in the absence of DSB-1. If validated, this would suggest that HIM-5 is likely to be directly involved in a process that promotes break formation, in contrast to factors such as HIM-17 and XND-1. It does not, however, support the idea that HIM-5 is at the top of a hierarchy of DSB factors, as it is interpreted here. More importantly, the data supporting this claim are unconvincing; only a single image of an unfixed gonad from an animal expressing HIM-5::GFP is shown. Immunofluorescence should be performed and the results must be quantified.

      We have provided additional images of the HIM-5 relocalization to show that we observed this in both fixed and live worms with two different tagged strains. The exclusion from the nucleus is seen in all scenarios. Whether the protein now accumulates exclusively in the cytoplasm/ is destabilized is challenging to address with the fixed images due to the arbitrariness of defining “background” staining.

      More generally, this type of analysis, looking at the interdependence of different factors for their association with chromosomes, is much more informative than the genetic interaction data presented in the paper, which does not seem to provide any mechanistic insights into the functions of the factors analyzed. The paper could potentially be greatly improved through a more extensive, systematic analysis of the interdependence of DSBpromoting factors for their localization to chromosomes.

      We have at least added this for XND-1 and HIM-17 and show they are not interdependent for chromosome association. We also provide for the first time data on the localization of HIM-5 in the dsb-1 mutant. Many of the other interactions have already been shown in the literature and/or were not warranted base on the lack of genetic interaction we present here.

      Minor issues:

      The title is vague and inconclusive. A more concrete title summarizing the major findings would help readers to assess whether the work is of interest.

      We have discussed the title extensively with all authors and all would like to keep the current title.

      The authors claim that the expression of HIM-5 from a different promoter (Ppie-1::him-5) but not its endogenous promoter (Phim-5::him-5) can partially rescue the DSB defect in him-17 mutants. To support this claim, they should really quantify the germline expression of HIM-5 in wild-type, him-17, him-17; Ppie-1::him-5, and Phim-5::him-5; him-17.

      We had previously reported the expression in the N2 background of both transgenes (McClendon et al., 2016)

      Panel O appears to be missing from Figure S3.

      Fixed

      The evidence for chromosome fusions in cep-1; mre-11 mutants shown in S4D is not convincing and the claim should be removed unless stronger evidence can be obtained.

      A clearer image has been added

      The basis of the following statement is unclear: "Furthermore, rec-1;him-5 double mutants give an age-dependent severe loss of DSBs (like dsb-2 mutants) suggesting that the ancestral function of the protein may have a more profound effect on break formation." The manuscript does not seem to include data regarding age-dependent loss of DSBs and no other publication is cited to support this claim. The interpretation is also perplexing; I think that it may be predicated on the idea that REC-1 and HIM-5 are paralogs, but as stated above, this claim is not well supported and is likely specious.

      We have added the reference. This was shown in Chung et al., 2013 – the paper that presented the cloning of the rec-1 locus.

  3. Sep 2025
    1. Author response:

      Joint Public Review

      This manuscript puts forward the provocative idea that a posttranslational feedback loop regulates daily and ultradian rhythms in neuronal excitability. The authors used in vivo long-term tip recordings of the long trichoid sensilla of male hawkmoths to analyze spontaneous spiking activity indicative of the ORNs' endogenous membrane potential oscillations. This firing pattern was disrupted by pharmacological blockade of the Orco receptor. They then use these recordings together with computational modeling to predict that Orco receptor neuron (ORN) activity is required for circadian, not ultradian, firing patterns. Orco did not show a circadian expression pattern in a qPCR experiment, and its conductance was proposed to be regulated by cyclic nucleotide levels. This evidence led the authors to conclude that a post-translational feedback loop (PTFL) clockwork, associated with the ORN plasma membrane, allows for temporal control of pheromone detection via the generation of multi-scale endogenous membrane potential oscillations. The findings will interest researchers in neurophysiology, circadian rhythms, and sensory biology. However, the manuscript has limited experimental evidence to support its central hypothesis and is undermined by several questionable assumptions that underlie their data analysis and model builds, as well as insufficient biological data, including critical controls to validate and/or fully justify the model the authors are proposing.

      We thank the reviewers for their thorough and thoughtful comments and believe that the manuscript will be much stronger once we incorporate the requested changes.

      Please note that we used ORN as acronym for “olfactory receptor neuron” throughout the manuscript. ORNs contain odorant receptors (ORs), and in insects these ORs have to associate with the olfactory receptor co-receptor (Orco) in the cilium of the neuron to form functional OR-Orco complexes for odorant detection. Besides this chaperone function, Orco can form homomers with the potential to act as ionic pacemaker channels; a role which we explore in this study.

      Strengths:

      The study is notable for its combination of long-term in vivo tip recordings with computational modeling, which is technically challenging and adds weight to the authors' claims. The link between Orco, cyclic nucleotides, and circadian regulation is potentially important for sensory neuroscience, and the modeling framework itself - a stochastic Hodgkin-Huxley formulation that explicitly incorporates channel noise - is a solid and forward-looking contribution. Together, these elements make the study conceptually bold and of clear interest to circadian and olfactory biologists.

      Major weaknesses:

      At the same time, several limitations temper the conclusions. The pharmacological evidence relies on a single antagonist and concentration, without key controls. The circadian analysis is based on relatively small numbers of neurons, with rhythms detected only in subsets, and the alignment procedure used in constant darkness raises concerns of bias. The molecular evidence is sparse, with only three qPCR timepoints, and the model, while creative, rests on assumptions that are not yet fully supported by in vivo data.

      Please see our responses to the detailed comments.

      Detailed comments are provided below:

      (1) The role for Orco proposed in the authors' model largely stems from the effects seen following the administration of (a single dose) of the Orco antagonist, OLC15. However, this hypothesis is undercut by the lack of adequate pharmacological controls, including a basic multipoint OLC15 dose-response series in addition to the administration of blockers for the other channels that are embedded in their model, but which were ruled out as being involved in the modulation of biological rhythms. In addition, these studies would (ideally) also benefit from the inclusion of the same concentration (series) of an inactive OLC15 analog to better control for off-target effects.

      The Orco agonist VUAA1 (Jones et al., 2011) binds directly to Orco and increases the channel open time probability. In M. sexta hawkmoths, we have already published that VUAA 1 increases the low spontaneous activity of ORNs in a dose-dependent fashion (Nolte et al., 2016). Chen and Luetje (2012) systematically varied the chemical structure of VUAA1 to identify new Orco ligands and discovered 22 Orco Ligand Candidates (OLC) that either activated or inhibited Orco. In their heterologous expression system, Orco was most sensitive to inhibition by OLC15. Based on these results, we published a dose-response curve of OLC15 inhibition (1-100 µM) using in vivo tip recordings of pheromone-sensitive long trichoid sensilla of M. sexta (Nolte et al., 2016). In that study, we could also demonstrate that OLC15 antagonizes the VUAA1 activation of Orco.

      Furthermore, we tested other published Orco antagonists in in vivo assays in intact hawkmoths, focusing on amiloride-derived antagonists, because we previously identified an amiloride-sensitive cation channel in hawkmoth ORNs. We found that, in contrast to OLC15, the amilorides HMA and MIA were not Orco-specific but instead affected different targets depending on time-of-day (Nolte et al., 2016). Based on those experiments and the dose-response curves we determined that the Orco agonist VUAA1 (Jones et al., 2011) and the Orco antagonist OLC15 (Chen and Luetje, 2012) worked best in hawkmoth ORNs to target Orco pharmacologically. Based on comparative tests with other published Orco antagonists we settled since then in all further experiments on a dose of 50 µM OLC15.

      We will clarify the Methods section accordingly.

      (2) The expression pattern of Orco was assessed using qPCR at only three timepoints. Rhythmic transcripts can easily be missed with such sparse sampling (Hughes et al., 2017). A minimum of six evenly spaced timepoints across a 24-hour cycle would be required to confidently rule out circadian transcriptional regulation. In addition, the use of the timeless mRNA control from another study is not acceptable. Furthermore, qPCR analysis measures transcript abundance, not transcription, as the authors repeatedly state. Transcriptional studies would require nuclear run-off or, more recently, can be done with snRNAseq analysis. Taken together, these concerns undermine the authors' desire to rule out TTFL-based control that directly led them to implicate a PTTF-based model.

      We agree with the referees that more time points and a direct comparison between timeless and Orco mRNA levels should be included in this manuscript. We will include these additional qPCR experiments and edit the manuscript to make clear that we measure transcript abundance, but we will not perform snRNAseq analysis due to time- and financial constraints. We are currently working on the transcriptional control of Orco, both during ontogeny and throughout the day but this work in progress is beyond the scope of this manuscript.

      (3) The modelling presented is based on Orco as a ZT-dependent conductance tied to the cAMP oscillations that were reported by this group in the cockroach and from the presence and functionality in Manduca of homomeric Orco complexes that are devoid of tuning ORs. While these complexes have been generated in cell culture and other heterologous expression systems, as well as presumably exist in vivo in the Drosophila empty neuron and other tuning OR mutants, there is no evidence that these complexes exist in wild-type Manduca ORNs. While this doesn't necessarily undermine every aspect of their models, the authors should note the presence of Orco/OR complexes rather than Orco homomeric complexes.

      Our ELISAs found circadian oscillations in cAMP levels not only in antennae of the Madeira cockroach (Schendzielorz et al., 2014, 2012), but also in hawkmoth antennae (Schendzielorz et al., 2015). We will add the 2015 citation to the Modeling chapter in the Methods section to clarify this.

      We agree with the referees that we cannot distinguish between Orco homo- and heteromers in the different compartments of our hawkmoth ORNs. Thus, as the referee suggests, we will add text regarding the presence and localization of OR-Orco heteromers. However, we have indications that Orco homomers could indeed be present in the hawkmoth ORNs. In a heterologous expression system, MsexOrco expression alone was sufficient to increase intracellular Ca<sup>2+</sup> levels in response to VUAA1 application (Nolte et al., 2013). In differentiating primary cell cultures of hawkmoth antennae, Orco expression started during a developmental time window where ORNs did not yet express pheromone receptors, and Orco affected spontaneous activity (Nolte et al., 2016). Thus, Orco homomers are present in developing hawkmoth ORNs during a time window where ORNs already express spontaneous activity but cannot heteromerize with pheromone receptors. However, we do not know whether and in what ratio homo- and heteromers of Orco and ORs are present in the respective sensillum compartments of adult hawkmoths (Nolte et al., 2013; Stengl, 1994; Stengl and Hildebrand, 1990).

      We will clarify our manuscript accordingly.

      (4) Some aspects of the authors' models, most notably the decision to phase align/optimize their DD and OLC15 recordings, are likely to bias their interpretations.

      It is consensus that insects display daily and circadian rhythms in pheromone-dependent mating, odor-gated feeding, and egg-laying behavior that phase-locks to environmental rhythms, corresponding with daily/circadian rhythms of sensory neuron physiology (e.g., Merlin et al., 2007; Rymer et al., 2007; Schendzielorz et al., 2015, 2012). However, circadian rhythms can be easily masked by stress, like the disturbances during a very challenging long-term recording experiment over several days. In addition, we observed in our animal raising facility that in LD 17:7 light-dark cycles the originally nocturnal hawkmoths M. sexta distribute their activity patterns over the course of the day, finding nocturnal as well as diurnal hawkmoths. Thus, light-dark cycles were not enough to ensure phase-synchronized behavioral rhythms, and it is very likely that the nocturnal hawkmoths rely heavily on pheromone/odor dependent synchronization as also found in other moth species (Ghosh et al., 2024). Here, we used isolated males that were never exposed to the female pheromones so that their circadian activity patterns readily disperse. Therefore, it became necessary in free-running conditions to first determine the respective behavioral rhythm for each animal, and then to phase-align their activity patterns to allow for statistical analysis. Otherwise, circadian differences would average out in a free-running population. As requested by the referees in point (7), we will use additional tests for rhythmicity in each of our recordings and revise the manuscript accordingly.

      Assuming that hawkmoths need pheromone presence as additional Zeitgeber, we are currently working on a new set of experiments where we attempt to improve synchronization by exposure to LD cycles and pheromone before DD and OLC15 recordings. We will add these experiments to the manuscript.

      (5) The tip recordings from long trichoid sensilla are critical aspects of this study. These recordings were carried out on upper sensillar tips located on the distal-most second annulus. Since there are approximately 80 annuli on the Manduca antennae, it is unclear whether the recordings are representative of the antennal response.

      We think the reviewers might have misinterpreted our description of the recording site. In the Methods, we state that we clip off the 20 most distal annuli (leaving a stump of about 60 annuli) and insert the reference electrode into the flagellum up to the second annulus from the cut end, i.e., the recording site is located at 2/3 – 3/4 of the antenna length as seen from the head of the animal. We will make this more clear in the Methods section.

      In addition, our lab did show with antibody stainings against Orco that apparently all ORNs that innervate long and short trichoid sensilla along the whole flagellum express the same staining pattern (Nolte et al., 2016). Furthermore, our patch clamp recordings of primary cell cultures of whole male antennae found largely overlapping ion channel populations across ORNs. This would indicate that all ORNs, whether they express pheromone- or general odorant receptors, could potentially share the same Orco-dependent spontaneous activity rhythms. In our lab, different experimenters from different years that recorded from long trichoid sensilla on different annuli did not detect obvious differences in neither the spontaneous activity nor the pheromone responses (c.f., Dolzer et al., 2003; Gawalek and Stengl, 2018; Schneider et al., 2025). Thus, it is very likely that we are reporting a general encoding mechanism that is not locally restricted along the antennal flagellum.

      (5.1) The authors do not provide any data in support of their cAMP/cGMP-based Orco gating…

      There are publications supporting cyclic nucleotide gating of Orco in Drosophila, but only after previous phosphorylation via protein kinase C (PKC; review: (Wicher and Miazzi, 2021)). Since Orco is very conserved among insect species, it is likely that these PKC and cGMP/cAMP-dependent regulations are present in other insect species. We are currently running thorough tip-recording experiments on the regulation of Orco gating, which are beyond the scope of this manuscript. However, we will add a set of experiments to this manuscript that demonstrates cAMP gating of Orco.

      (5.2)… and the PTTF model proposed is somewhat disappointing.

      For a detailed introduction of our PTFL membrane clock hypothesis please see our opinion paper (Stengl and Schneider, 2024).

      (5.3) The model seems to be influenced by their long-held proposal that insect olfactory signaling has a critical metabotropic component involving cyclic nucleotides, PKC, etc, a view that may be influenced by the use of Orco homomeric complexes generated in HEK cells.

      Indeed, we propose a metabotropic pheromone-transduction cascade, which in moths and cockroaches is based on G-protein-mediated activation of phospholipase C but not on adenylyl cyclase activation. Our hypothesis is not influenced by HEK cell heterologous expression studies of Orco but is supported by our own work comparing in vivo tip recordings of intact hawkmoths with patch clamp experiments on hawkmoth primary cell cultures of olfactory receptor neurons, which are able to respond to their species-specific pheromones in vitro ((Schneider et al., 2025; Stengl, 2010; Stengl and Funk, 2013; Wicher and Miazzi, 2021). In addition, a multitude of publications by other laboratories with in vivo and in vitro studies using physiological, genetic, and immunocytochemical assays all support a metabotropic signal transduction cascade in insect olfaction (reviews: Stengl, 2010; Stengl and Funk, 2013; Wicher and Miazzi, 2021). In contrast, the hypothesis suggesting a solely ionotropic pheromone- and general odor-dependent transduction cascade for all insect species is based on very sparse experimental evidence, based primarily on heterologous expression studies such as HEK cells that lack the insect’s WT molecular surroundings, and thus, cannot predict OR-Orco function in vivo. Furthermore, the ionotropic hypothesis is heavily based upon the argument that an inverse 7TM receptor cannot couple to G-proteins, which lacks careful backup via biochemical and structural studies. In addition, the ionotropic hypothesis lacks support via carefully performed physiological in vivo studies in different insect species that paid attention to analysis of the distinct kinetic components of ORN´s odor/pheromone responses and that employ physiological concentrations and durations of odor/pheromone stimuli (please see our most recent publication by Schneider et al. (2025)).

      (5.4) Nevertheless, structural studies on Orco do not support a cyclic nucleotide binding site, although PKC-based phosphorylation has been implicated in the fine-tuning/adaptation of olfactory signaling.

      While structural studies did not find evidence for conserved known cyclic nucleotide binding sites on Orco, this does not exclude the presence of so far unknown binding sites, or via sites that fold out only after a specific sequence of previous phosphorylations of the many phosphorylation sites on Orco. Indeed, physiological studies in Drosophila presented evidence for cyclic nucleotide dependence of Orco after previous PKC-dependent phosphorylation (Getahun et al., 2013). Our ongoing in vivo experiments in hawkmoths further corroborate a PKC- and cAMP-dependent modulation of Orco. These studies will be published in a follow-up publication.

      (6) Because only 5/11 LD and 7/10 DD animals showed daily rhythms, with averages lacking clear daily modulation, the methods are not sufficiently reliable enough to reveal novel underlying mechanisms of circadian rhythm generation. The reported results are therefore not yet reliable or quantifiable. To quantify their results, the authors should apply tests for circadian rhythmicity using methods such as RAIN, JTK CYCLE, MetaCycle, or Echo. The use of FFT and Wavelet is applauded, but these methods do not have tests of significance for rhythms and can be biased when analyzing data in which there could only be 1-3 circadian cycles. Because the conclusions appear to be based on 11-12 neurons that were recorded for 2-4 days, the reader is concerned that the methods are not yet perfected to provide strong evidence for circadian regulation of spontaneous firing of ORNs. The average data (e.g., Figure 3Bii and 3Cii) highlight the apparent lack of daily rhythms. In summary, the results would be more compelling if more than 50% of the recordings had significant circadian amplitudes and with similar periods and phases.

      The long-term tip-recordings of intact hawkmoths are very challenging and take a very long time to accomplish, thus, we are very happy that we succeeded in obtaining so many of them (N=34). Since 5/11 LD recordings and 7/10 DD recordings revealed daily/circadian rhythmicity and since many other physiological recordings at different ZTs of different members of our laboratory all revealed ZT-dependent pheromone-transduction we can be certain that the physiology of hawkmoth antennae is under strict circadian control. Please see also our response to (4) above commenting the phase-dispersal of activity rhythms observed in our experiments, as well as in the behavior of hawkmoth males in the mating cage.

      Nevertheless, we will follow the advice of the referees to apply additional tests for significance of rhythms in spontaneous activity, and we are thankful for the tests suggested that we were not aware of.

      (7) The statement that circadian patterns of ORN firing are lost with the Orco antagonist (OLC15) is not strongly supported. The manuscript should be revised to quantify how Orco changed circadian amplitude in the 12 recorded neurons. Measures of circadian amplitude can avoid confusing/vague statements like Line 394 “low and high frequency bands appeared to merge during the activity phase around ZT 0 in the animals that showed clear circadian rhythms (N = 5 of 11 in LD)”. The conclusion that Orco blocks circadian firing appears to be contradicted by Figure 6, which indicates that ~6 of these neurons had circadian periods detected by wavelet. The manuscript would be strengthened with details about the specificity and reproducibility of the Orco antagonist. The authors quantify the gradual decrease in firing with the slope of a linear fit to estimate how the “effectiveness [of OLC15] increased over time.” They conclude that the drug “obliterated circadian rhythms and attenuated the spontaneous activity in several, but not all experiments (N = 8 of 12).” The report would be greatly strengthened with corroborating data from additional Orco antagonists and additional doses of OLC15 (the authors use only 50 uM OLC15).

      We will revise our data analysis, according to the valuable suggestions of the referees.

      However, based upon our previous studies with other Orco antagonists and different doses of OLC15 (Nolte et al., 2016) we found that 50 µM OLC15 is the best Orco antagonist dose in M. sexta to target Orco-dependent modulation of spontaneous action potential activity of hawkmoth olfactory receptor neurons. Please see also our response to (1).

      (8) The manuscript includes several statements that are more speculation than conclusion. For example, there is no evidence for tuning or plasticity in this report. Statements like the following should be removed or addressed with experiments that show changes in odor response specificity or sensitivity: "ORN signalosomes are highly plastic endogenous PTFL clocks comprising receptors for circadian and ultradian Zeitgebers that allow to tune into internal physiological and external environmental rhythms as basis for active sensing." (Discussion Line 622). The paper concludes that (line 380) "mean frequency of spontaneous spiking and the frequency of bursting expressed daily modulation, and are both most likely controlled via a circadian clock that targets the leak channel Orco." This is too bold given the available results.

      We will revise the discussion accordingly and clarify which statements are supported via published evidence and which are predictions based upon our novel hypothesis published in our opinion paper (Stengl and Schneider, 2024).

      (9.1) Because Orco conductance is modulated by cyclic nucleotides, it remains highly plausible that circadian regulation occurs upstream at the level of signaling pathways (e.g., calcium, calcium-binding proteins, GPCRs, cyclases, phosphodiesterases).

      We agree with the referees that it is very likely that there are multiple layers of interconnected feedback cycles that control Orco localization and activity. Our novel hypothesis suggests interlocked TTFL and PTFL control of physiological circadian rhythms, not strictly hierarchical TTFL control, which would require a daily turnover of membrane proteins and transcriptional control via the established TTFL clock in insect ORNs. We currently search for TTFL control at all levels of odor/pheromone transduction using ZT-dependent transcriptomics in combination with qPCR and single nuclear transcriptomics, involving also all the molecules suggested by the referees. These studies are ongoing, are very time- and money-consuming, and are beyond the scope of this manuscript.

      (9.2) The possibility that circadian oscillations of cyclic nucleotides are generated by the canonical TTFL mechanism has not been excluded. In fact, extensive work in Drosophila has demonstrated that the TTFL-based molecular clock proteins are required for circadian rhythms in olfaction.

      Our experiments that test circadian TTFL control at different levels of the cAMP transduction cascade in hawkmoth antennae are on the way and are part of another publication. We will revise our discussion accordingly.

      The experiments published for TTFL dependent control of Drosophila olfaction that we are aware of (Krishnan et al., 1999; Tanoue et al., 2004) do not exclude interlinked PTFL and TTFL clocks. Krishnan et al. (1999) demonstrate that the TTFL clock in antennal olfactory receptor neurons correlates with circadian rhythms in odor responses measured in electroantennogram (EAG) recordings, not in single sensillum recordings as in our experiments. EAG recordings comprise not only voltage responses of the olfactory sensory neurons but also voltage changes generated in non-neuronal antennal cells such as trichogen and tormogen cells that built the transepithelial potential gradient via vATPases that generates the high K<sup>+</sup> concentration in the sensillum lymph (Jain et al., 2024; Klein, 1992; Thurm and Küppers, 1980). In addition, EAG recordings most likely contain responses of afferent neurons originating from somata in the brain that maintain central control of the antennae. Thus, EAG recordings are difficult to interpret.

      (11) A defining feature of circadian oscillators is the feedback mechanism that generates a time delay (e.g., PERIOD/TIMELESS repressing their own transcription). While the authors describe how cyclic nucleotides can regulate Orco conductance, they do not provide a convincing explanation of how Orco activity could, in turn, feed back into the proposed PTFL to sustain oscillations. For these reasons, the authors should consider:

      a) Providing a broader discussion of non-TTFL models of circadian rhythms (e.g., redox cycles, post-translational modifications).

      We will revise the discussion accordingly.

      b) Reassessing Orco expression using a higher-resolution temporal sampling ({greater than or equal to}6 timepoints per 24 h).

      We will add those experiments to the revised version of the manuscript (see our response to (2)).

      c) Clarifying or revising the PTFL model to explicitly address how feedback would be achieved. Alternatively, the data may be more consistent with Orco conductance rhythms being regulated by post-translational mechanisms downstream of the canonical TTFL oscillator, as suggested by the Drosophila olfactory system literature.

      We will revise the manuscript accordingly.

      Minor weaknesses:

      (1) The authors should compare the firing patterns of ORN neurons to the bursts, clusters, and packets of retinal efferent spikes reported in Liu JS and Passaglia CL (2011; JBR). By comparing measures in moths to measures in Limulus, the authors might be able to address the question: Is the daily firing pattern of ORN neurons likely a conserved feature of circadian control of sensory sensitivity?

      We will revise the discussion accordingly.

      (2) The methods need further details. For example, it is unclear if or how single neuron activity was discriminated and whether the results were compromised by the relatively large environmental fluctuations in temperature (21-27oC), humidity (35-60%), or other cues known to modulate spontaneous firing.

      We will clarify the Methods section.

      References

      Chen S, Luetje CW. 2012. Identification of New Agonists and Antagonists of the Insect Odorant Receptor Co-Receptor Subunit. PLOS ONE 7:e36784. doi:10.1371/journal.pone.0036784

      Dolzer J, Fischer K, Stengl M. 2003. Adaptation in pheromone-sensitive trichoid sensilla of the hawkmoth Manduca sexta. J Exp Biol 206:1575–1588. doi:10.1242/jeb.00302

      Gawalek P, Stengl M. 2018. The Diacylglycerol Analogs OAG and DOG Differentially Affect Primary Events of Pheromone Transduction in the Hawkmoth Manduca sexta in a Zeitgebertime-Dependent Manner Apparently Targeting TRP Channels. Front Cell Neurosci 12:218. doi:10.3389/fncel.2018.00218

      Getahun MN, Olsson SB, Lavista-Llanos S, Hansson BS, Wicher D. 2013. Insect Odorant Response Sensitivity Is Tuned by Metabotropically Autoregulated Olfactory Receptors. PLOS ONE 8:e58889. doi:10.1371/journal.pone.0058889

      Ghosh S, Suray C, Bozzolan F, Palazzo A, Monsempès C, Lecouvreur F, Chatterjee A. 2024. Pheromone-mediated command from the female to male clock induces and synchronizes circadian rhythms of the moth Spodoptera littoralis. Curr Biol 34:1414-1425.e5. doi:10.1016/j.cub.2024.02.042

      Jain K, Prelic S, Hansson BS, Wicher D. 2024. Expression of Drosophila melanogaster V-ATPases in Olfactory Sensillum Support Cells. Insects 15:1016. doi:10.3390/insects15121016

      Jones PL, Pask GM, Rinker DC, Zwiebel LJ. 2011. Functional agonism of insect odorant receptor ion channels. Proc Natl Acad Sci 108:8821–8825. doi:10.1073/pnas.1102425108

      Klein U. 1992. The insect V-ATPase, a plasma membrane proton pump energizing secondary active transport: immunological evidence for the occurrence of a V-ATPase in insect ion-transporting epithelia. J Exp Biol 172:345–354. doi:10.1242/jeb.172.1.345

      Krishnan B, Dryer SE, Hardin PE. 1999. Circadian rhythms in olfactory responses of Drosophila melanogaster. Nature 400:375–378. doi:10.1038/22566

      Merlin C, Lucas P, Rochat D, François M-C, Maïbèche-Coisne M, Jacquin-Joly E. 2007. An Antennal Circadian Clock and Circadian Rhythms in Peripheral Pheromone Reception in the Moth Spodoptera littoralis. J Biol Rhythms 22:502–514. doi:10.1177/0748730407307737

      Nolte A, Funk NW, Mukunda L, Gawalek P, Werckenthin A, Hansson BS, Wicher D, Stengl M. 2013. In situ Tip-Recordings Found No Evidence for an Orco-Based Ionotropic Mechanism of Pheromone-Transduction in Manduca sexta. PLOS ONE 8:e62648. doi:10.1371/journal.pone.0062648

      Nolte A, Gawalek P, Koerte S, Wei H, Schumann R, Werckenthin A, Krieger J, Stengl M. 2016. No Evidence for Ionotropic Pheromone Transduction in the Hawkmoth Manduca sexta. PLOS ONE 11:e0166060. doi:10.1371/journal.pone.0166060

      Rymer J, Bauernfeind AL, Brown S, Page TL. 2007. Circadian rhythms in the mating behavior of the cockroach, Leucophaea maderae. J Biol Rhythms 22:43–57. doi:10.1177/0748730406295462

      Schendzielorz J, Schendzielorz T, Arendt A, Stengl M. 2014. Bimodal Oscillations of Cyclic Nucleotide Concentrations in the Circadian System of the Madeira Cockroach Rhyparobia maderae. J Biol Rhythms 29:318–331. doi:10.1177/0748730414546133

      Schendzielorz T, Peters W, Boekhoff I, Stengl M. 2012. Time of Day Changes in Cyclic Nucleotides Are Modified via Octopamine and Pheromone in Antennae of the Madeira Cockroach. J Biol Rhythms 27:388–397. doi:10.1177/0748730412456265

      Schendzielorz T, Schirmer K, Stolte P, Stengl M. 2015. Octopamine Regulates Antennal Sensory Neurons via Daytime-Dependent Changes in cAMP and IP3 Levels in the Hawkmoth Manduca sexta. PLOS ONE 10:e0121230. doi:10.1371/journal.pone.0121230

      Schneider AC, Schröder K, Chang Y, Nolte A, Gawalek P, Stengl M. 2025. Hawkmoth Pheromone Transduction Involves G-Protein–Dependent Phospholipase Cβ Signaling. eNeuro 12:ENEURO.0376-24.2024. doi:10.1523/ENEURO.0376-24.2024

      Stengl M. 2010. Pheromone Transduction in Moths. Front Cell Neurosci 4:133. doi:10.3389/fncel.2010.00133

      Stengl M. 1994. Inositol-trisphosphate-dependent calcium currents precede cation currents in insect olfactory receptor neurons in vitro. J Comp Physiol A 174:187–194. doi:10.1007/BF00193785

      Stengl M, Funk NW. 2013. The role of the coreceptor Orco in insect olfactory transduction. J Comp Physiol A 199:897–909. doi:10.1007/s00359-013-0837-3

      Stengl M, Hildebrand JG. 1990. Insect olfactory neurons in vitro: morphological and immunocytochemical characterization of male-specific antennal receptor cells from developing antennae of male Manduca sexta. J Neurosci 10:837–847. doi:10.1523/JNEUROSCI.10-03-00837.1990

      Stengl M, Schneider AC. 2024. Contribution of membrane-associated oscillators to biological timing at different timescales. Front Physiol 14:1243455. doi:10.3389/fphys.2023.1243455

      Tanoue S, Krishnan P, Krishnan B, Dryer SE, Hardin PE. 2004. Circadian Clocks in Antennal Neurons Are Necessary and Sufficient for Olfaction Rhythms in Drosophila. Curr Biol 14:638–649. doi:10.1016/j.cub.2004.04.009

      Thurm U, Küppers J. 1980. Epithelial physiology of insect sensilla In: Locke M, Smith DS, editors. Insect Biology in the Future. Academic Press. pp. 735–763. doi:10.1016/B978-0-12-454340-9.50039-2

      Wicher D, Miazzi F. 2021. Functional properties of insect olfactory receptors: ionotropic receptors and odorant receptors. Cell Tissue Res 383:7–19. doi:10.1007/s00441-020-03363-x

    1. Botryllus schlosseri (Tunicata) is a colonial chordate that has long been studied for its multiple developmental pathways and regenerative abilities and its genetically determined allorecognition system based on a polymorphic locus that controls chimerism and cell parasitism. We present the first chromosome-level genome assembly from an isogenic colony of B. schlosseri clade A1 using a mix of long and short reads scaf-folded using Hi-C. This haploid assembly spans 533 Mb, of which 96% are found in 16 chromosome-scale scaffolds. With a BUSCO completeness of 91.2%, this complete and contiguous B. schlosseri genome assembly provides a valuable genomic resource for the scientific community and lays the foundation for future investigations into the molecular mechanisms underlying coloniality, regeneration, histocompatibility, and the immune system in tunicates.

      This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf097), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 3: Cristian Canestro

      TO THE AUTHORS

      In this MS entitled 'First chromosome-level genome assembly of the colonial chordate model Botryllus schlosseri (Tunicata)', Olivier De Thier and colleagues report the first chromosome-scale assembly of this colonial ascidian specie, paying special attention to differences with previous published assemblies and importantly between haplotypes. The MS is very well written, very easy and pleasant to read. This provides data of great quality and very relevant not only for the ascidian/tunicate community, but to the field of genome structural evolution. I firmly recommend it for publication, although I think that the authors could discuss it in deeper detail. Specially, I miss for instance a more elaborate discussion of the results in our understanding of the similarities and differences between clades that have been published in the last years (I have not been able to find some relevant articles in this regard cited in the bibliography). I also feel that a deeper analysis of the differences between haplotypes could be very interesting, unless they are artifactual effects of the assemblies. As mentioned below, unless this is part of a longer story for a different MS beyond the scope of this one, I encourage the authors to validate some of the differences they find between haplotypes, and try to correlate the structural variations, with differences in gene counts between haplotypes, and to explore whether these differences could be correlated with aspects of biological relevance. I miss, for instance, Venn diagrams with gene contents between previous assemblies, and the haplotypes/haploid genome here reported. In any case, I firmly recommend this MS for publications, since most of my suggestions are not intended to interrogate the results of the MS, but to improve it, but I also understand that some may go beyond the scope of this MS.

      Minor points: Introduction Page 1: "the basic body plan of adult tunicates is highly conserved across the entire subphylum [3]". This sentence, which could be OK for ascidians, probably provides a highly simplified vision of Tunicate adult morphologies, specially comparing the divergent morphologies of Thaliaceans and Appendicularians. Please, elaborate the sentence.

      To understand the comparisons between the data of this MS and previously reported genomes, it seems crucial to understand well the meaning of the "clades and subclades". Please, include in the introduction (or where needed), how are defined those clades, which are their origins and biological/geographical differences, … and all the critical information that will specially help non-tunicate readers to understand the results.

      Results: The authors refer to the presence of large-scale genomic palindromes in Bs1 and Bs3. But it is unclear what are these structures. I suggest to please provide some more detailed explanation about the palindromic nature of these regions.

      The data of haplotype-resolved assemblies is very interesting. I wonder if it is possible to somehow measure the amount of heterozygosity between haplotype 1 and 2, and those versus the previous versions of the genome, to better understand intra and inter-variation between subclades? The differences of the size of some regions between Colombera and this study, and even between haplotypes 1 and 2, are very interesting. I would find more informative to merge the three graphs of Figure S9 into one single graph, so we can also easily compare the different in sizes of the haplotypes with the haploid. If some of those differences are actually due to deletions, that would deserve further analysis. If this analysis is not part of another ongoing project that will be published somewhere else, I suggest identifying with a dot-plot some of those differences, specially between haplotypes, and validate with long-reads crossing those regions whether some of the deletions are real or artifactual. Please, include the dotplot graph together with the two haplotypes in figure S10. In those cases that could be real, it would be very interesting what genes are gone, and if those are not placed somewhere else in the genome as result of translocations, or those genes are actually gone and could explain some of the differences reported in the gen count between haplotypes.

      The authors mentioned the presence of multiple structural variations, although some of which could be artifactual of miss-assemblies. Interestingly, the plot of the synteny blocks between the two haplotypes in figure S11 shows some of those structural variations, including cases of: - deletions: for instance, there are "blank" regions in Bs1A and Bs3A with no lines, which may reflect areas that are not present in the haplotype B. - duplications and translocations within chromosomes or between chromosomes of different haplotypes. Just looking to this plot, I wonder how the distribution of chromosomes between haplotypes is done. For instance, I see that Bs7B shares a duplicated synteny block with chromosomes Bs10B and Bs14B, but not with Bs10A and Bs10B, which means that the duplications are intra-haplotype present in B but not in A. But I wonder if it is possible that Bs10B and Bs14B could be in fact switched to haplotype A, and therefore there would be no duplication nor deletion in one of the haplotypes, just a simple translocation. I may be wrong in the interpretation, but I'm curious to understand the graph. In any case, again, as mentioned above, it would be worthy to validate some of those variations with long reads, which could illuminate the biological relevance between the haplotypes and discard potential artifactual errors of the assemblies.

      I notice that in figures 7 and S13, some lines are thicker than others. Is this because many "thin" lines are overlapped, and they look like a "thick" line. Otherwise, the visual effect of different thicknesses could be misleading. Please, clarify.

      In the analysis of the Hox cluster the authors say "[…] our new assembly revealed that B. schlosseri's Hox genes are not scattered. Instead, eight of them were clustered on the second largest scaffold (Bs2), whereas two other ones are found on the 15th largest scaffold (Bs15)." Generally, the description of the Hox gene in a cluster refers to the fact they are in the vicinity, with near not many other genes in between Hox genes. Therefore, I would not describe that eight Hox genes are clustered by the simple fact that they are in the same chromosome (maybe even in different arms).

    1. AbstractBackground Reference genomes for the entire sea turtle clade have the potential to reveal the genetic basis of traits driving the ecological and phenotypic diversity in these ancient and iconic marine species. Furthermore, these genomic resources can support conservation efforts and deepen our understanding of their unique evolution.Results We present haplotype-resolved, chromosome-level reference genomes and high-quality gene annotations for five sea turtle species. This completes the catalog of reference genomes of the entire sea turtle clade when combined with our previously published reference genomes. Our analysis reveals remarkable genome synteny and collinearity across all species, despite the clade’s origin dating back more than 60 million years. Regions of high interspecific genetic distance and intraspecific genetic diversity are consistently clustered in genomic hotspots, which are enriched with genes coding for immune response proteins, olfactory receptors, zinc fingers, and G-protein-coupled receptors. These hotspot regions may offer insights into the genetic mechanisms driving phenotypic divergence among species, and represent areas of significant adaptive potential. Ancient demographic analysis revealed a synchronous population expansion among sea turtle species during the Pleistocene, with varying magnitudes of demographic change, likely shaped by their diverse ecological adaptations, and biogeographic contexts.Conclusions Our work provides genomic resources for exploring genetic diversity, evolutionary adaptations, and demographic histories of sea turtles. We outline genomic regions with increased diversity, linked to immune response, sensory evolution, and adaptation to varying environments that have historically been subject to strong diversifying selection, and likely will underpin sea turtle’s responses to future environmental change. These reference genomes can assist conservation by providing insights into the demographic and evolutionary processes that sustain and threaten these iconic species.

      This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf105), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 2: Brendan Reid

      The authors of this work provide a fantastic addition to the genomic resources currently available for marine turtles with five new, apparently high-quality reference genomes. These new resources enable a number of interesting cross-species analyses in this group, including phylogenetic reconstruction, inference of demographic history, and identification of hotspots of diversity and divergence. I though this paper was quite clearly written and easy to read overall, and I have one major and a few more minor comments/suggestions.

      Major comment: there is an extensive literature on hybridization among marine turtle lineages (see Vilaca et al. 2021, https://doi.org/10.1111/mec.16113, for a recent genomic example), with lots of evidence for ancient gene flow after initial lineage divergence as well as recent hybridization. The authors do not really mention this phenomenon at all, and since I think it has a lot of bearing on all of the results it would make sense to re-think your findings in light of the fact that some level of gene flow has occurred. Would extensive synteny/lack of genomic rearrangements potentially enable hybridization? Is overall low divergence among lineages potentially a function of gene flow? Are regions of high divergence the result of selection (as you suggest), or could these regions potentially be resistant to gene flow? I believe that IQtree assumes a strictly bifurcating tree, and gene flow can influence PSMC inferences (see Mazet et al. 2016, https://doi.org/10.1038/hdy.2015.104) - how would gene flow among lineages affect your inference of divergence dates and demographic histories?

      MInor commentsL [note - line numbers would have been helpful for providing comments on specific items! I will refer to the lower-left page numbers and paragraph instead]:

      page 3, paragraph 2: Some of the applications you refer to here don't seem terribly germane to the relevance of "genomic resources" in management and conservation per se, and several are just methods using some kind of genetic data ... e.g., "abundance"/close-kin mark recapture doesn't require full genomes (and the reference you cite used microsat data), and the "community"/eDNA applications don't generally rely on genomes but instead on databases of a few (usually mitochondrial) genes. Either include methods that truly benefit from the development of high-quality reference genomes or broaden this to something like "growth in molecular ecology techniques".

      page 4, paragraph 2: last sentence is a bit of a run-on, could break this up a bit.

      page 10, paragraph 3: for me, the ROH methods need some additional explanation and interpretation. The more detailed methods indicate that the ROH were identified on the basis of lower-than-average heterozygosity rather than true homozygosity - I can understand why this might have been done (since the baseline level of heterozygosity varies across species) but it still seems a bit arbitrary and could risk mistaking stretches with simply low variation for IBD tracts. I wonder if a ROH-detection method like ROHan that explicitly incorporates baseline genomic heterozygosity into its model would be more appropriate for comparing results across species and could give different results. I also question a bit the interpretation of these low-diversity tracts as evidence of inbreeding per se. The authors do not comment much on the length distributions of these ROH - given that many of them are quite short I would expect that if there was mating between close kin it probably happened far back in the past and the IBD tracts have been broken up by recombination.

      page 11, paragraph 2: for PSMC analyses it is important to note the method assumes that differences in coalescence time/Ne across the genome result from demography alone. If portions of the genome are under balancing/diversifying selection (such as the areas of high diversity that you detect in this study), the local Ne for inferred these regions would be expected to be larger than the rest of the genome, which could lead to the spurious detection of population expansion or contraction (more likely a contraction for balancing selection). See Boitard et al. 2022 (https://doi.org/10.1093/genetics/iyac008) for a more detailed treatement. I would try excluding the regions putatively under diversifying selection and re-run PSMC to see if your inferences change.

    1. AbstractThe vast majority of cancers exhibit Somatic Copy Number Alterations (SCNAs)—gains and losses of variable regions of DNA. SCNAs can shape the phenotype of cancer cells, e.g. by increasing their proliferation rates, removing tumor suppressor genes, or immortalizing cells. While many SCNAs are unique to a patient, certain recurring patterns emerge as a result of shared selectional constraints or common mutational processes. To discover such patterns in a robust way, the size of the dataset is essential, which necessitates combining SCNA profiles from different cohorts, a non-trivial task.To achieve this, we developed CNSistent, a Python package for imputation, filtering, consistent segmentation, feature extraction, and visualization of cancer copy number profiles from heterogeneous datasets. We demonstrate the utility of CNSistent by applying it to the publicly available TCGA, PCAWG, and TRACERx cohorts. We compare different segmentation and aggregation strategies on cancer type and subtype classification tasks using deep convolutional neural networks. We demonstrate an increase in accuracy over training on individual cohorts and efficient transfer learning between cohorts. Using integrated gradients we investigate lung cancer classification results, highlighting SOX2 amplifications as the dominant copy number alteration in lung squamous cell carcinoma.

      This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf104), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 2: Ellen Visscher

      The paper introduces a python package for imputation, filtering, segmentation, feature extraction and visualisation of CNA profiles. It explains some of the elements of the package, and then demonstrates how data from multiple cohorts can be processed and combined using the package preprocessing pipeline. The authors then use processed data from 3 different cohorts to perform cancer type prediction using a CNN. From this, they get an interesting result to find a biomarker that differentiates two different lung cancers. Throughout, they show visualisations using their package. The package itself seems well documented and designed to be used. There is some clarification required in the methods section specifically around the CNN training and the models therein. There is also one major question of whether all the preprocessing steps are actually required for the downstream CNN analysis. Overall, however, this is a well written manuscript, providing a useful software tool for further analysis of CNA data.

      Major comments: - CNN section- how are the segments decided- is it based on all the training data, or just data in a batch? - Throughout the results pertaining to figure 3A-C, you call it test accuracy- to be clear is this is based on your CV hold outs? This should be reworded everywhere to reflect this. As cross validation indicates, this is not a test set and is a validation set- which is also the way you use it. - Regarding the above, you have a comment saying: "the best test accuracy without cross-validation was 92.34%". Could you please clarify what you mean by this. Only in the CNN section do you describe your training approach, which does not mention a test or separate validation set. - It reads slightly unclearly- you have a section called "model transfer", but are you training 3 different models- one per dataset? You only have one figure for training results which suggests one dataset, but then you have this section called model transfer? - Re all the above, please dedicate a small subsection in methods making this clearer. Are there dedicated test sets? If your main results are for aggregated data, then what are you testing on to ensure generalisability? What is the point of training the 3 different models on 3 different datasets? Perhaps it would make more sense to hold one dataset out as your test set. In some ways, that is what the model transfer is showing, but it would be less confusing to clarify that aim instead of suddenly introducing 3 models. - If the CNN architecture is essentially the same as in Attique et. al., the performance is basically the same and they use only CNs a gene locations- how does this demonstrate that the preprocessing from CNSistent is necessary or advantageous for this task? Maybe having a result which combines CN calls naively over gene locations and comparing to this across the aggregate datasets would be a good way of comparing? I.e showing that preproccessing does offer an advantage when combining different datasets together? Also because this is what you argue in your abstract. For this analysis you would have to make sure you also compare across the same samples to differentiate between filtering/other preprocessing steps. - In Figure 3I, you say "notice the similarity of chromosome 3 pattern for the correctly classified LUSC samples (red) and the misclassified ones (orange)". This is confusing because the orange and red are not similar. In fact for this whole section, it seems that figure 3I does not align with what you are saying?

      Minor comments/errors: - Clarification on why CNSistent needs a reference genome if it's dealing with segments? How is this information used- is it just for the known gaps? - Your caption of Supplementary Figure 1 has a typo about a breakpoint at 16 instead of 14. - You do not explain how you use the knee pt to filter (i.e is it samples above/below the knee pt.) - Your CNN graphic is difficult to interpret and non-standard. - CNN section should clarify at the beginning what the input is and what the output is (i.e a prediction that a sample belongs to a particular cancer type) before explaining the architectural details. - Even though you control for class imbalance, some cancer types are so poorly represented it is unlikely a CNN could learn that, you do kind of mention this in the discussion, but maybe some sort of minimum threshold for inclusion would make sense. - For Fig2D you refer to it as GND, but the axes/title says hemizygosity-are these things equivalent? E.g could have 3-3, low hemizygosity but not diploid? Or if it's aggregated across the whole genome its assumed equivalent? - There is a grammatical error "Runtimes decreased in a near-linearly with the number of compute cores" - You make a comment that "We therefore suspect some TCGA lung cancers might be cases of co-occurring adeno and squamous carcinomas." This is a possibility but given pleiotropy of many phenotypes- it may also be that the biomarker is not always unique to squamous carcinomas.

      Suggestions/Nice to haves: - Maybe make it clearer inside the paper what visualisations come with CNSistent. Looking at the software documentation, there's obviously a lot of useful visualisations that come with that- and some of them you have used in Figure 3 for e.g. - Given there are more total CN callers, maybe good to mention somewhere how CNSistent would work for total CNs only. - You remove profiles that you say are uninformative, could you not include this and then just show how accuracy correlates with no. of break-pts (for e.g). In some ways one might think that there could be useful information in few alteration profiles- because those alterations might be more upstream/causal. - The aggregation step could maybe affect downstream analysis. I.e taking the average could introduce CNs that were never called. Even using min/max- this implies a constant copy number in that region, which may lose information- e.g if it is a functional region having two diff CNs across gene might imply non-functionality. Did you explore the effect of aggregation step? Perhaps taking a small enough resolution of segment types would account for this anyway.

    1. AbstractPolyadenylation is a dynamic process which is important in cellular physiology. Oxford Nanopore Technologies direct RNA-sequencing provides a strategy for sequencing the full-length RNA molecule and analysis of the transcriptome and epi-transcriptome. There are currently several tools available for poly(A) tail-length estimation, including well-established tools such as tailfindr and nanopolish, as well as two more recent deep learning models: Dorado and BoostNano. However, there has been limited benchmarking of the accuracy of these tools against gold-standard datasets. In this paper we evaluate four poly(A) estimation tools using synthetic RNA standards (Sequins), which have known poly(A) tail-lengths and provide a valuable approach to measuring the accuracy of poly(A) tail-length estimation. All four tools generate mean tail-length estimates which lie within 12% of the correct value. Overall, Dorado is recommended as the preferred approach due to its relatively fast run times, low coefficient of variation and ease of use with integration with base-calling.

      This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf098), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 1: Christoph Dieterich

      In this manuscript, the authors present a benchmark to assess the performance of different tools designed for estimation of polyA tail length from Nanopore direct RNA-sequencing data. These tools include tailfindr, nanopolish, Dorado and Boost Nano. Benchmarks on tools and algorithms to analyze Nanopore data, both third party tools and official ONT releases, are of utmost importance for the field. The use of synthetic constructs with known ground truth is recommended as well. Consequently, this study has the potential to provide a significant contribution to the field.

      In the current form, I can however not recommend it for publication in GigaScience. My major concerns are: a) Use of only RNA002 data. This chemistry is outdated and thus the Benchmark is only relevant for old, possibly already published data. A comprehensive Benchmark should also include RNA004 and available tools there (at least Dorado). b) The current data set only contains two polyA tail length, which are relatively short and do not cover longer polyA tails that are common e.g. in mammalian cells. A proper Benchmark should show the performance of the analyzed tools over a range of polyA tail lengths.

      Minor comments: 1) Abstract: "All four tools generate mean tail-length estimates which lie within 13% of the correct value." The value of 13% is given in the Abstract from the submission system, wherease the abstract in the Main text says 12%. Which value is correct? 2) Background, first paragraph: the role of the polyA tail in RNA circularization, which is required for efficient translation of cellular mRNAs is not mentioned. Reference is missing for "is increasingly recognised as a dynamic process which influences timing and degree of protein production." 3) Background, second paragraph: Chiron seems to be a relatively old basecaller (no models for new chemistries). It should be mentioned here that it is required for BoostNano. 4) Mis-priming of internal polyA sites may an important confounding (and currently overlooked) source of errors in Nanopore sequencing. This should be quantified properly and analyzed in more detail (length of these stretches, influence of other nucleotides within the A-rich stretch, etc.). Should be done as well on whole transcriptome data with more possible mispriming sites. 5) Why do the authors think that the poly(T) stretch of the RTA might be truncated? This is composed of DNA oligos, which should be quite stable 6) What are the parameters for filtering used by Dorado and BoostNano? Can the authors explain, why the filtered reads differ? 7) Dorado seems to systematically underestimate polyA tail length. Is this true also for data generated with RNA004 chemistry and longer polyA tails?

    1. AbstractThe ability to differentiate between viable and dead microorganisms in metagenomic data is crucial for various microbial inferences, ranging from assessing ecosystem functions of environmental microbiomes to inferring the virulence of potential pathogens from metagenomic analysis. While established viability-resolved genomic approaches are labor-intensive as well as biased and lacking in sensitivity, we here introduce a new fully computational framework that leverages nanopore sequencing technology to assess microbial viability directly from freely available nanopore signal data. Our approach utilizes deep neural networks to learn features from such raw nanopore signal data that can distinguish DNA from viable and dead microorganisms in a controlled experimental setting of UV-induced Escherichia cell death. The application of explainable AI tools then allows us to pinpoint the signal patterns in the nanopore raw data that allow the model to make viability predictions at high accuracy. Using the model predictions as well as explainable AI, we show that our framework can be leveraged in a real-world application to estimate the viability of obligate intracellular Chlamydia, where traditional culture-based methods suffer from inherently high false negative rates. This application shows that our viability model captures predictive patterns in the nanopore signal that can be utilized to predict viability across taxonomic boundaries. We finally show the limits of our model’s generalizability through antibiotic exposure of a simple mock microbial community, where a new model specific to the killing method had to be trained to obtain accurate viability predictions. While the potential of our computational framework’s generalizability and applicability to metagenomic studies needs to be assessed in more detail, we here demonstrate for the first time the analysis of freely available nanopore signal data to infer the viability of microorganisms, with many potential applications in environmental, veterinary, and clinical settings.Author summary Metagenomics investigates the entirety of DNA isolated from an environment or a sample to holistically understand microbial diversity in terms of known and newly discovered microorganisms and their ecosystem functions. Unlike traditional culturing of microorganisms, genomic approaches are not able to differentiate between viable and dead microorganisms since DNA might persist under different environmental circumstances. The viability of microorganisms is, however, of importance when making inferences about a microorganism’s metabolic potential, a pathogen’s virulence, or an entire microbiome’s impact on its environment. As existing viability-resolved genomic approaches are labor-intensive, expensive, and lack sensitivity, we here investigate our hypothesis if freely available nanopore sequencing signal dat that captures DNA molecule information beyond the DNA sequence might be leveraged to infer such viability. This hypothesis assumes that DNA from dead microorganisms accumulates certain damage signatures that reflect microbial viability and can be read from nanopore signal data using fully computational frameworks. We here show first evidence that such a computational framework might be feasible by training a deep model on controlled experimental data to predict viability at high accuracy, exploring what the model has learned, and using it in a real-world application by application to a bacterial species of veterinary relevance. We finally show that a specific model has to be trained to accurately predict viability after antibiotic exposure of a mock microbial community. While the generalizability of our computational framework therefore needs to be assessed in much more detail, we here demonstrate that freely available data might be usable for relevant viability inferences in environmental, veterinary, and clinical settings.

      This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf100), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 1: Finlay Maguire

      In this paper the authors train a ResNet-based model to predict whether individual 10,000 sample chunks of nanopore signal data originate from live or killed bacterial isolate cultures. From live and UV-killed (at exponential phase) E. coli K-12 cultures DNA was extracted and sequenced using separate R10.4.1 flowcells on a MinION. Signal data from each read in the live and dead extractions were then processed by discarding the first 1,500 samples and dividing the remaining signals into 10,000 sample chunks. These were then split into a balanced 60:20:20 train, test, and validation datasets with the constraint that no two chunks from the same read would end up in the same dataset (e.g., chunk 1 and chunk 2 of 1st read in the killed culture would hypothetically be separated into train and test). During this they also explored/compared the impact of chunk size, model architecture, and performance of a sequence based model using the E. coli data. With a nicely performed class-activation map and masking approach they then identified the signal regions most strongly associated with dead-predictions (such as twisting/kinking/pore blockage of DNA around pyrimidine dimers). Finally, they applied their trained model to a live and heat-killed Chlamydia abortus culture and compared their results to stained microscopy and propidium monoazide PCR measures of viability. They found equivalent performance on the C. abortus data to their E. coli data (despite a different killing-method and taxa).

      The manuscript is well written and the methods are clearly described (including well documented code and deposited data). The authors explainability methodology is excellent although it would have been nice to see a bit more in-depth interpretation of those results. The authors have also presented a convincing case that nanopore signal data does contain information that can be used to distinguish signal chunks from live and dead bacterial monocultures. This methods has the potential to be useful in clinical and environmental genomics if it can be extended to more heterogeneous metagenomic samples. However, despite the title and framing of this manuscript (i.e., "metagenomics"), their analyses do not involve any metagenomic data and their results so far do not demonstrate if this is fesible. Currently, the overall framing (and title) of the manuscript is not appropriate given the work performed at this point. Similarly, given that both E. coli and C. abortus "dead" cultures resulted in median read length less than half the live cultures, the authors do not fully make the case that the signal and ResNet approach is actually required relative to simpler baseline models. Finally, although they did evaluate performance on a complete separate dataset, the authors should at least explore/quantify the correlation of live/dead prediction across chunks of the same read given the default expectation of non-independence of signal chunks from the same read.

      Major - Although the title and framing of the paper suggest that the authors are classifying live and dead bacteria in metagenomic datasets, the actual experiments and method developed are entirely based around sequencing of cultured clonal bacterial isolates. Metagenomic datasets are going to have considerably more heterogeneity in viability, species composition, and DNA signal characteristics. Given this, the paper's title, introduction, and parts of the discussion are a bit of an oversell and inappropriate. This manuscript should be revised to more clearly reflect the work actually performed.

      • This paper doesn't establish whether a ResNet + Signal approach actually outperforms a much simpler baseline. For example, given there is a clear extraction and median read-length differences between live and dead samples, it is possible that a much simpler logistic model using basic features such as read length and/or translocation could perform equivalently.

      • Although the C. abortus analysis demonstrates limited impact of leakage, I'm still a bit concerned that the potential non-independence of chunks from the same read (i.e., chunk 1 and chunk 3 of the same read are more likely to share similar live/dead signal characteristics than Chunk 1 and 3 of different reads). By not having multiple chunks of the same read in the training, validation, or test datasets the authors may have avoided issues with longer-reads being more represented in their datasets. However, this has the potential to introduce data leakage between train and test set (which may impact generalisability when they attempt to extend this method to metagenomics). I think this paper would be improved by some exploration of the correlation of live/dead prediction across chunks of the same read. How often do different chunks of the same read disagree? How does this impact the overall performance of the model? Does taking the average prediction across chunks of the same read improve or degrade performance? Would this problem be better suited to a multiple instance learning approach (i.e., a live/dead label applied to all chunks from a single read) especially in more heterogeneous datasets? To what degree do longer reads with more chunks contribute disproportionately to the overall performance in the C. abortus dataset?

      Minor

      • SRA records don't seem to be live yet (https://www.ncbi.nlm.nih.gov/sra?linkname=bioproject_sra_all&from_uid=1123127)

      • Are the actual pod5 files available?

      • Read-level performance should be analysed and reported.

      • Figure 1B: the test subplot numbers are almost too small to read - they may benefit from being its own panel.

      • Plot axes labels are not always clear (e.g., Figure 3) percentage of what? Chunks? or Reads? It would be nice to see consistent capitalisation of labels and legends.

      • Predictions on viable E. coli and viable C. abortus seems surprisingly similar (91.44% vs 91.34% viable and 8.56% vs 8.66% dead) despite different taxa, potentially underlying viable cell proportion, and output probability densities. This would benefit from further discussion/analysis - do misclassified chunks have any common characteristics? Would you expect the E. coli to have similar microscopy/PCR measured viability percentage as the C. abortus.

      • Would be good to see a bit more discussion/exploration of impact of mixed live/dead cells given ~37.6% viability measure in the C. abortus sample (e.g., how well do models perform with different ratios of live/dead reads) - could potentially be achieved using in-silico spike ins).

    1. There is a third kind of answer that, without competing with the previous two, demonstrates the value of philosophy, even (perhaps, especially) for students like our imagined protagonist: philosophy is the antidote to the uncritical acceptance of the world and ourselves as we are.

      I like the phrase "antidote to the uncritical acceptance" quite a lot. At first, you may think that an "uncritical acceptance" isn't necessarily a bad thing. However, thinking about it more, do you really want to just blindly accept the world around you? Looking critically at yourself and the world allows you to make changes and work to improve the lives of yourself and others, among many other things, simply because you dared to question.

    2. The deep underlying idea is that if we have to choose a social and political arrangement without knowing the position that we may occupy in society, we will choose fair principles to govern our social and political institutions. My teacher had our class re-enact a scenario very much like this one in class. We discussed the principles that would govern our imagined society before we picked our fate out of a hat. Until that point in my young life, I had never thought about justice in that way

      This is a very interesting way to think about justice. The author introduces this method to imagine a fair society with no bias. The reason this works so well is because not knowing what ur position in society will be, allows you to genuinely try ur best to make society as fair and enjoyable as possible for every individual

    3. Therefore, the first step in this kind of philosophical education is to shake students out of a complacent and uncritical acceptance of the world as it is.

      I think this is one of the most important reasons why we need to study philosophy. When we repeat our daily routines and become accustomed to them, we tend to overlook the injustices within them or we may not even recognize them as injustices. Philosophy enables us to think more critically about the society we live in, its institutions, and the impact they have on us.

    4. When students take this imaginative exercise seriously, they start to feel as discomfited as Descartes himself must have. The ground starts shaking under them. It is at this moment that philosophy starts its work.

      By asking so many bizarre questions that one normally does not consider on a day-to-day basis, it pushes us outside of our comfort zone and forces us to take a step into the unknown. This encourages our brains to work in different ways that it may normally not think, ask questions beyond our general scope of thinking, and create new connects and ideas that we may normally have not considered. I think this kind of emphasizes the importance of philosophy because it teaches us how to react when we are pushed outside of our comfort zone and how to think beyond our normal flow of consciousness.

    5. Many philosophers have persuasively criticized Rawls’ use of the original position as an argumentative tool. But we often forget, I think, how successfully it harnesses the power of the imagination to construct an alternative vision of what society could be like.

      We are so used to the life we live that we in ways we become comfortable in it. When imagining a different reality, one in which they may be less high up/wealthy, it becomes difficult for some to acknowledge just the amount of privilege they once had. The "Theory of Justice" gives people a different perspective on life and how different each and every person's life is from one another.

    6. The deep underlying idea is that if we have to choose a social and political arrangement without knowing the position that we may occupy in society, we will choose fair principles to govern our social and political institutions. My teacher had our class re-enact a scenario very much like this one in class. We discussed the principles that would govern our imagined society before we picked our fate out of a hat. Until that point in my young life, I had never thought about justice in that way. The power of this exercise contributed in no small way to my becoming a philosopher. I have recreated a similar activity in various classes I have taught. The discussion it generates among students is reliably superb, but the best moment is when students discover their fate – whether they end up being a doctor or a garbage truck driver or a poor young mother – and have to reckon (at least for that class period) with their principles. Many philosophers have persuasively criticized Rawls’ use of the original position as an argumentative tool. But we often forget, I think, how successfully it harnesses the power of the imagination to construct an alternative vision of what society could be like.

      Though it was a little difficult for me to picture this in real life as it is not realistic that society is completely unaware of ones capabilities before choosing their position in the social hierarchy, I think that this is fascinating to imagine. We often forget that we may not be as secure in our social status or career as we think we are so it is important to be aware of those of lower status around you and not take your position for granted.

    7. Now, ask yourself: what could philosophy do for you?

      I think this is a very interesting start to this article! It puts us into the shoes of someone in a difficult position, in which they must tirelessly work away to simply have a shot at a decent, livable lifestyle. I feel that this scenario they painted for us so vividly is really powerful when leading into this question, because I think people in the current climate of the world tend to underestimate the importance of philosophy, or don't really think about it at all. While maybe a lot of us don't completely relate to the situation of the young mother, a lot of us DO have our own struggles and might find ourselves lost in the grueling work that may come with everyday life. And when simply going through with our daily lives is hard enough, why should we bother with philosophy? Personally, I don't really think about the idea of philosophy at all, and I never really thought it would be relevant to me based on what I want to do in life. And when people don't think something is relevant, why bother with it, right? Life is busy enough as it is. But really, it probably has a lot more relevancy in my life than I think, and I believe that this idea is somewhat being conveyed in this part. That's just how I saw this paragraph, but I thought it was a strong opening!

    8. The deep underlying idea is that if we have to choose a social and political arrangement without knowing the position that we may occupy in society, we will choose fair principles to govern our social and political institutions. My teacher had our class re-enact a scenario very much like this one in class. We discussed the principles that would govern our imagined society before we picked our fate out of a hat. Until that point in my young life, I had never thought about justice in that way. The power of this exercise contributed in no small way to my becoming a philosopher. I have recreated a similar activity in various classes I have taught. The discussion it generates among students is reliably superb, but the best moment is when students discover their fate – whether they end up being a doctor or a garbage truck driver or a poor young mother – and have to reckon (at least for that class period) with their principles. Many philosophers have persuasively criticized Rawls’ use of the original position as an argumentative tool. But we often forget, I think, how successfully it harnesses the power of the imagination to construct an alternative vision of what society could be like.

      This is a brilliant way to describe others lived experiences and how what might not affect you, could affect someone else. Using philosophical teachings can reveal the privileges of some and the shortcomings of others and hopefully create a better understanding of everyones blindspots in day to day life. Truly a very powerful and humbling exercise that can help create common ground and allow others to empathize with eachother and hopefully create a more just society.

    9. The deep underlying idea is that if we have to choose a social and political arrangement without knowing the position that we may occupy in society, we will choose fair principles to govern our social and political institutions. My teacher had our class re-enact a scenario very much like this one in class. We discussed the principles that would govern our imagined society before we picked our fate out of a hat. Until that point in my young life, I had never thought about justice in that way. The power of this exercise contributed in no small way to my becoming a philosopher. I have recreated a similar activity in various classes I have taught. The discussion it generates among students is reliably superb, but the best moment is when students discover their fate – whether they end up being a doctor or a garbage truck driver or a poor young mother – and have to reckon (at least for that class period) with their principles. Many philosophers have persuasively criticized Rawls’ use of the original position as an argumentative tool. But we often forget, I think, how successfully it harnesses the power of the imagination to construct an alternative vision of what society could be like.

      This idea that we must get rid of the idea of "safety" within our lives and experiences can be imagined as a vision of the future that we as people, don't want to imagine. Being a "poor mother" or a "garbage truck driver" can be thought of as a disappointing fate to many who attend college, it can even be a fate so poor in the minds of students, that it serves as motivation in their eyes ; to not be like "them" , its a phrase that sticks with many who hold themselves to a high idea of success. But I believe and resonate with this idea of harnessing imagination as it broadness our perspective on education and life, because no matter how safe we feel behind a wall of education or wealth, there can always be a force of society that challenges our goals.

    1. AbstractWater buffalo is a cornerstone livestock species in many low- and middle-income countries, yet major gaps persist in its genomic characterization—complicated by the divergent karyotypes of its two sub-species (swamp and river). Such genomic complexity makes water buffalo a particularly good candidate for the use of graph genomics, which can capture variation missed by linear reference approaches. However, the utility of this approach to improve water buffalo has been largely unexplored.We present a comprehensive pangenome that integrates four newly generated, highly contiguous assemblies of Pakistani river buffalo with available assemblies from both sub- species. This doubles the number of accessible high-quality river buffalo genomes and provides the most contiguous assemblies for the sub-species to date. Using the pangenome to assay variation across 711 global samples, we uncovered extensive genomic diversity, including thousands of large structural variants absent from the reference genome, spanning over 140 Mb of additional sequence. We demonstrate the utility of these data by identifying putative functional indels and structural variants linked to selective sweeps in key genes involved in productivity and immune response across 26 populations.This study represents one of the first successful applications of graph genomics in water buffalo and offers valuable insights into how integrating assemblies can transform analyses of water buffalo and other species with complex evolutionary histories. We anticipate that these assemblies, and the pangenome and putative functional structural variants we have released, will accelerate efforts to unlock water buffalo’s genetic potential, improving productivity and resilience in this economically important species.

      This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf099), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 4: Wai Yee Low

      Review of "A comprehensive water buffalo pangenome reveals extensive structural variation linked to population specific signatures of selection". This is an impressive work at the frontier of buffalo genomics. I truly enjoy reading the work and my questions/comments are aimed at improving it further. My detailed comments are below: Line 30: I think it is better you include the actual number of publicly available assemblies used to create the pangenome graph. Line 71: There is now a swamp buffalo reference genome with annotation too (NCBI accession: PCC_UOA_SB_1v2). Perhaps consider to cite the swamp buffalo ref https://academic.oup.com/gigascience/article/doi/10.1093/gigascience/giae053/7753516 and rewrite the sentence to say a pangenome can be used for both swamp and river, but a single linear ref from either subspecies for read mapping is not good enough. Line 79: "highlighted" Line 82: What do you mean by "higher quality"? The assemblies have been discussed in this review: https://www.frontiersin.org/journals/genetics/articles/10.3389/fgene.2021.629861/full Line 105: Technically, the graph method for bovine species, which includes water buffalo, is being investigated by the Bovine Pangenome Consortium (BPC). However, nothing useful has been published on the buffalo graph but perhaps consider citing the BPC since your paper overlaps with it (https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-02975-0). Line 165: It will be good if you add a bit more context of the PanGenie method here as the researchers in buffalo community are not used to this. Additionally, it will be great if all code is made available on GitHub or as Supplementary Info. Line 170: To produce phase pangenome graph, don't you need all input assemblies to be phased? All are input assemblies phased? The UOA_WB_1 is locally phased, not phased throughout the genome. Line 235: "a list of 403 unrelated individuals." What does this translate to in terms that geneticists can understand? Do you mean siblings have been removed? Or individuals sharing the same grandparents were removed? Line 246: Can you please explain how did you get the coordinates to match between the GATK and PanGenie method? You'll need matching coordinates for concordance analysis. As I understand it, the GATK was based on UOA_WB_1? Line 254: Why these 3 chromosomes? Line 257: If you had not filtered for relatedness, how will it impact the selective sweep work? I think including some context will help the readers. Line 259: do you mean at least six samples per group? If yes, is 6 samples enough? Line 261: genotype quality less than 25 according to bcftools? Since you only used biallelic variants, please provide the breakdown between biallelic and multiallelic. Line 281: "… we first PacBio HiFi sequenced one female" Please rewrite this. Line 282: How common are these two breeds in percentage? Line 291: Is this already known? Perhaps cite the literature to show the agreement with previous studies? Fig 1D: This is a bit too small to see especially the SV distribution at the bottom. I can hardly see the median? Line 310: Why did you choose UOA_WB_1 as the reference? Line 311: the ~32.8 mil variants are comprised of SNPs as well? Fig 2: This is probably a panel of a figure but should not be the entire figure. The size of the circle indicates sample size but there should be a legend on the plot for this to say the sizes, right? Darker colour should be used to highlight the countries with samples instead of white? Maybe this could be a Supp figure too. Line 356: S Figure 4 and 5 should be main figures? You will need to annotate the abbreviation of sample-country in the legend of S Figure 5. Line 360: "To enable reuse we have made this dataset available …" The dataset should be made available to reviewers? Line 368: "76% of SNVs were called by both callers" 76% seem low. Also, called does not mean concordant. What is the concordance among called SNVs in both? Did the pangenome approach called most of the variants found in GATK? If not, what might be the reasons? Fig 3B: It is not immediately clear what the difference is, between non repetitive and repetitive regions. The overlapping text in the x-axes makes it hard to read. Line 390: "Analyses such as the study of selective sweeps or genome-wide association studies where low frequency variants are often filtered out will benefit less from the advantages of GATK, particularly given its longer run time." From here on, in this paragraph, it's Discussion, not Results. Line 418: Why human? Could you use cattle? Line 427: I tried the browser and not sure what I can learn from it. It will be helpful if there is a README with some examples on what can be explored. Line 450: How large before you considered it as larger variant? Is this ability to study larger variants still hold despite using only ~10 assemblies in the graph? The use of short reads for selective sweep study will still benefit from being able to incorporate these larger variants? As I understand it, the larger variants were found only from graph, not from the short reads. As such, the selective sweep may not be associated with any larger variants? Line 470: Fig S8 should be a main figure? Line 513: Instead of uniprot link, perhaps consider including this as Supplementary info or text. The info in the link may change in the future. Line 551: However, without scaffolding, the assemblies of Pakistani river buffalo may not be good enough to function as reference genomes for river buffalo? Line 552: When considering new bases, did you do this for each assembly independently or the new bases were discovered cumulatively? Line 581: Some of my questions at Line 450 can be discussed here. Line 586: Perhaps consider discussing the limitations of the small number of assemblies used to create the graph. As such, many SVs are likely still missing and we are still unable to properly assess allele frequency of these larger SVs. Additionally, while some SVs may not be considered as large in this work, it does not mean they have no impact.

    1. AbstractBackground Influenza A virus (IAV) poses a significant threat to animal health globally, with its ability to overcome species barriers and cause pandemics. Rapid and accurate IAV subtypes and host source prediction is crucial for effective surveillance and pandemic preparedness. Deep learning has emerged as a powerful tool for analyzing viral genomic sequences, offering new ways to uncover hidden patterns associated with viral characteristics and host adaptation.Findings We introduce WaveSeekerNet, a novel deep learning model for accurate and rapid prediction of IAV subtypes and host source. The model leverages attention-based mechanisms and efficient token mixing schemes, including the Fourier Transform and the Wavelet Transform, to capture intricate patterns within viral RNA and protein sequences. Extensive experiments on diverse datasets demonstrate WaveSeekerNet’s superior performance to existing models that use the traditional self-attention mechanism. Notably, WaveSeekerNet rivals VADR (Viral Annotation DefineR) in subtype prediction using the high-quality RNA sequences, achieving the maximum score of 1.0 on metrics including the Balanced Accuracy, F1-score (Macro Average), and Matthews Correlation Coefficient (MCC). Our approach to subtype and host source prediction also exceeds the pre-trained ESM-2 (Evolutionary Scale Modeling) models with respect to generalization performance and computational cost. Furthermore, WaveSeekerNet exhibits remarkable accuracy in distinguishing between human, avian, and other mammalian hosts. The ability of WaveSeekerNet to flag potential cross-species transmission events underscores its significant value for real-time surveillance and proactive pandemic preparedness efforts.Conclusions WaveSeekerNet’s superior performance, efficiency, and ability to flag potential cross-species transmission events highlight its potential for real-time surveillance and pandemic preparedness. This model represents a significant advancement in applying deep learning for IAV classification and holds promise for future epidemiological, veterinary studies, and public health interventions.

      This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf089), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 1:Will Dampier

      The manuscript presented by Nguyen et al. is well written, well researched, and well executed. The use of this new "wavelet style" neural network shows both an increased training efficiency and improved accuracy at detecting influenza subtypes for surveillance. However, I think their comparison to a 'plain' Transformer model does not take advantage of the improvements in pre-training and transfer-learning that have become standard practice in deep-learning. I have also included some stylistic suggestions to improve the figures as presented. After addressing these comments, I believe that this will become a very strong manuscript.

      Major Comments:

      The authors present a comparison between their new wavelet architecture and a standard transformer architecture using a one-hot encoded vector of amino-acids. I believe that this is the correct 'null model' to compare your wavelet architecture to, however, it does not represent the 'state of the art' in utilizing transformers for sequence analysis. As I'm sure the authors are aware, the disadvantage of transformers is that they take an extensive amount of training (they note the transformer only models take 2-4X more training epochs to converge). However, the advantage they bring is that they can be extensively trained for one task and then transfer that learning to another related task. A number of models have been pre-trained on giant collections of proteins Asgari et al, https://doi.org/10.1371/journal.pone.0141287 and Rives et al https://doi.org/10.1073/pnas.2016239118 which then allow one to transfer that knowledge to different domains with fewer examples such as demonstrated in Dampier et al https://doi.org/10.3389/fviro.2022.880618. It would be interesting to see whether your wavelet model defeats these pre-trained models with transfer learning. If you showed that, you could argue that there is no need for the extensive expense of 'foundational models'.

      The authors discuss that there is a significant imbalance in the training set and they used up-sampling and limiting to balance out the class representation. Since the classes are not equally represented, the model may not be equally able to predict each class. And the high metrics may only be a representation of its ability to predict the popular classes correctly. The authors should include an additional set of figures (supplemental is fine) that show the metrics broken out by Subtype. It would also be interesting to see a graph of the class-size (before up-sampling) vs F1-score (or another metric) on that class. This could provide lower-bounds for how many samples are needed to train the model.

      Minor Comments:

      Figures 3, 4, and 5: These would benefit from a linked y-axis. It is hard to compare across A/B/C/D when the axes have different y-limits.

    1. Author response:

      We thank both reviewers for their valuable comments. We have prepared a point-by-point response below.

      Reviewer #1 (Public review):

      Weaknesses:

      (1) The conclusions regarding the links between neural and behavioral mechanisms are mostly well supported by the data. However, what is less convincing is the authors' argument that their study offers evidence of 'priming'. An important hallmark of priming, at least as is commonly understood by cognitive scientists, is that it is stimulus specific: i.e., a repeated stimulus facilitates response times (repetition priming), or a repeated but previously ignored stimulus increases response times (negative priming). That is, it is an effect on a subsequent repeated stimulus, not ANY subsequent stimulus. Because (prime or target) stimuli are not repeated in the current experiments, the conditions necessary for demonstrating priming effects are not present. Instead, a different phenomenon seems to be demonstrated here, and one that might be more akin to approach/avoidance behavior to a novel or salient stimulus following an appetitive/aversive stimulus, respectively.

      (2) On a similar note, the authors' claim that 'priming' per se has not been well studied in non-human animals is not quite correct and would need to be revised. Priming effects have been demonstrated in several animal types, although perhaps not always described as such. For example, the neural underpinnings of priming effects on behavior have been very well characterized in human and non-human primates, in studies more commonly described as investigations of 'response suppression'.

      We thank the reviewer for these critical comments. After careful consideration of both reviews, we agree that “priming” may not be the most accurate term to describe the behavioral phenomenon. We plan to revise our terminology throughout the manuscript accordingly to better capture the generalized nature of the effect we observe.

      (3) The outcome measure - i.e., difference scores between the two odors or odor and non-odor (i.e., the number of flies choosing to approach the novel odor versus the number approaching the non-odor (air)) - appears to be reasonable to account for a natural preference for odors in the mock-trained group. However, it does not provide sufficient clarification of the results. The findings would be more convincing if these relative scores were unpacked - that is, instead of analyzing difference scores, the results of the interaction between group and odor preference (e.g., novel or air) (or even within the pre- and post-training conditions with the same animals) would provide greater clarity. This more detailed account may also better support the argument that the results are not due to conditioning of the US with pure air.

      We use the PI score as a standard metric to quantify all the odor preference in behavioral assays because it allows for robust comparison across different genetic or treatment groups under the same experimental setting. In T-maze, real time tracking of fly trajectories is technically difficult. With olfactory arenas, we showed some examples of fly distribution in quadrants over the entire odor choice test period (Figure 2—figure supplement 2) for both pre-trained and post-trained groups and discussed the trajectories in Discussion. We will ensure this point is clarified in the revised text.                       

      Reviewer #2 (Public review):

      […] They finally recorded from different mushroom body output neurons, including the one (MBON-γ4γ5) likely affected by the increased activity of the corresponding γ4 reward dopaminergic neurons after shock preexposure. They recorded odour-evoked responses from these neurons before and after shock preexposure, but did not find any plasticity, while they found a logical effect during spaced cycles of aversive training.

      We thank the reviewer for the summary. We would like to clarify that we did, in fact, observe plasticity in MBON-γ4γ5 following shock exposure, as shown in Figure 4B.

      Overall, the study is very interesting with a substantial amount of behavioural analysis and in vivo 2-photon calcium imaging data, but some major (and some minor) issues have to be resolved to strengthen their conclusions.

      (1) According to neuropsychological work (Henson, Encyclopedia of Neuroscience (2009), vol. 7, pp. 1055-1063), « Priming refers to a change in behavioral response to a stimulus, following prior exposure to the same, or a related, stimulus. Examples include faster reaction times to make a decision about the stimulus, a bias to produce that stimulus when generating responses, or the more accurate identification of a degraded version of the stimulus". Or "Repetition priming refers to a change in behavioural response to a stimulus following re-exposure" (PMID: 18328508). I therefore do not think that the effects observed by the authors are really the investigation of the neural mechanisms of priming. To me, the effect they observed seems more related to sensitisation, especially for the activation of sweet-sensing neurons. For the shock effect, it could be a safety phenomenon, as in Jacob and Waddell, 2020, involving (as for sugar reward) different subsets for short-term and long-term safety.

      As noted in our response to Reviewer #1, we plan to revise our use of the term “priming” in the manuscript to more accurately interpret the behavioral phenomenon.

      (2) The author missed the paper from Thomas Preat, The Journal of Neuroscience, October 15, 1998, 18(20):8534-8538 (Decreased Odor Avoidance after Electric Shock in Drosophila Mutants Biases Learning and Memory Tests). In this paper, one of the effects observed by the authors has already been described, and the molecular requirement of memory-related genes is investigated. This paper should be mentioned and discussed.

      We thank the reviewer for bringing this important reference to our attention. We will cite the Preat (1998) paper and discuss its relevant findings in relation to our own in the revised manuscript.

      (3) Overall, the bidirectional effect they observed is interesting; however, their results are not always clear, and the use of a delta PI is sometimes misleading. The authors have mentioned that shocks induced attraction to the novel odour, while they should stick to the increase or decrease in preference/avoidance.

      The ΔPI is calculated either as (trained PI – mock PI) for different animals or as (post PI – pre PI) for the same animals, with the specific calculation clarified in each figure legend. A positive ΔPI signifies an increase in preference for the odor, which is equivalent to a relative attraction or a decrease in avoidance.

      As not all experiments are done in parallel logic, it is not always easy to understand which protocol the authors are using. For example, only optogenetics is used in the appetitive preexposure. Does exposing flies to sugar or activating reward dopaminergic neurons also increase odour avoidance? The observed increased odour avoidance after optogenetic activation of sweet-sensing neurons involve reward (e.g., decreased response) and/or punishment (e.g., increased response) to increase odour avoidance?  

      We used different behavioral assays (T-maze or arena), stimuli (real shock or optogenetics), and protocols (different or same animal groups) to robustly demonstrate the phenomenon across platforms. We explained each protocol in the figures or texts, and we’ll make them clearer to follow in the revised version. We focused on activating a clean set of sugar sensing neurons because this optogenetic stimulus is an effective and efficient substitute to real sugar. We agree that testing reward dopaminergic neuron activation is a logical extension and will consider adding these experiments in the revised work.

      The author should always statistically test the fly behavioural performances against 0 to have an idea of random choice or a clear preference toward an odour.

      Our primary focus is on the change in preference induced by training, rather than the innate odor preference itself, which can be highly variable due to physiological and environmental factors. Statistical testing against 0 for innate preference scores is not standard practice in this specific paradigm, as the critical question is whether a treatment alters behavior relative to a control.

      On the appetitive side, the internal hunger state would play an important role. The author should test it or at least discuss it.

      For appetitive experiments, we always starve the flies on 1% agar for two days prior to behavioral tests to standardize their hunger state. We will consider adding fed flies as control groups in the revised work.

      (4) The authors found a discrepancy between genetic backgrounds; sometimes the same odour can be attractive or aversive.

      We observed minor discrepancies in innate odor preferences across genetic backgrounds, which is a known and common occurrence. Different genotypes and temperatures can result in different baseline PI scores. However, the key finding is that the relative change in odor preference following an aversive stimulus is consistent: it increases the relative preference for an odor compared to air. This sometimes reverses valence (aversion to attraction) and other times simply reduces aversion. Our analysis focuses on this consistent, relative change.

      Different effects between the T-maze and the olfactory arena are found. The authors proposed that: "Punishment priming effect was still not detected, probably due to the insensitivity of the optogenetic arena". This is unclear to me, considering all prior work using this arena. The author should discuss it more clearly.

      The punishment effect with CS+ present was reliably detected in the T-maze (Figure 1A) but was not significant in the olfactory arena (Figure 2—figure supplement 1B-C). We hypothesize that the olfactory arena assay is less sensitive than the T-maze for detecting such subtle behavioral changes. This is evidenced by the fact that even classical odor-shock conditioning yields lower PI in the arena (typically ~0.4) than in the T-maze (~0.8), likely due to the greater distance flies must explore and travel. The higher variance in the arena may therefore mask more modest effects. Here the effect under investigation was induced by optogenetically activating only a small subset of aversive dopaminergic neurons, a stimulus that is likely weaker than full electric shock. This reduced stimulus strength may have contributed to the challenge of detecting a significant effect in the less sensitive arena paradigm.

      They mentioned that flies could not be conditioned with air and electric shock. However, flies could be conditioned with the context + shock, which is changing in the T-maze and not in the optogenetic area.

      While flies can be conditioned to context, during the optogenetic stimulation period in the arena, the light is delivered uniformly across all four quadrants. Therefore, any potential context conditioning would be equivalent across the entire chamber and should not bias the final distribution of flies between the odor and air quadrants during the test, nor affect the calculated PI score.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review): 

      Summary: 

      The authors revealed the cellular heterogeneity of companion cells (CCs) and demonstrated that the florigen gene FT is highly expressed in a specific subpopulation of these CCs in Arabidopsis. Through a thorough characterization of this subpopulation, they further identified NITRATE-INDUCIBLE GARP-TYPE TRANSCRIPTIONAL REPRESSOR 1 (NIGT1)-like transcription factors as potential new regulators of FT. Overall, these findings are intriguing and valuable, contributing significantly to our understanding of florigen and the photoperiodic flowering pathway. However, there is still room for improvement in the quality of the data and the depth of the analysis. I have several comments that may be beneficial for the authors. 

      Strengths: 

      The usage of snRNA-seq to characterize the FT-expressing companion cells (CCs) is very interesting and important. Two findings are novel: 1) Expression of FT in CCs is not uniform. Only a subcluster of CCs exhibits high expression level of FT. 2) Based on consensus binding motifs enriched in this subcluster, they further identify NITRATE-INDUCIBLE GARP-TYPE TRANSCRIPTIONAL REPRESSOR 1 (NIGT1)-like transcription factors as potential new regulators of FT. 

      We are pleased to hear that reviewer 1 noted the novelty and importance of our work. As reviewer 1 mentioned, we are also excited about the identification of a subcluster of companion cells with very high FT expression. We believe that this work is an initial step to describe the molecular characteristics of these FT-expressing cells. We are also excited to share our new findings on NIGT1s as potential FT regulators. We believe this finding will attract a broader audience, as the molecular factor coordinating plant nutrition status with flowering time remains largely unknown despite its well-known phenomenon.

      Weaknesses: 

      (1) Title: "A florigen-expressing subpopulation of companion cells". It is a bit misleading. The conclusion here is that only a subset of companion cells exhibit high expression of FT, but this does not imply that other companion cells do not express it at all. 

      We agree with this comment, as it was not our intention to sound like that FT is not produced in other companion cells than the subpopulation we identified. We revised the title to more accurately reflect the point. The new title is “Companion cells with high florigen production express other small proteins and reveal a nitrogen-sensitive FT repressor.”

      (2) Data quality: Authors opted for fluorescence-activated nuclei sorting (FANS) instead of traditional cell sorting method. What is the rationale behind this decision? Readers may wonder, especially given that RNA abundance in single nuclei is generally lower than that in single cells. This concern also applies to snRNA-seq data. Specifically, the number of genes captured was quite low, with a median of only 149 genes per nucleus. Additionally, the total number of nuclei analyzed was limited (1,173 for the pFT:NTF and 3,650 for the pSUC2:NTF). These factors suggest that the quality of the snRNA-seq data presented in this study is quite low. In this context, it becomes challenging for the reviewer to accurately assess whether this will impact the subsequent conclusions of the paper. Would it be possible to repeat this experiment and get more nuclei?

      We appreciate this comment; we noticed that we did not clearly explain the rationale for using single-nucleus RNA sequencing (snRNA-seq) instead of single-cell RNA-seq (scRNA-seq). As reviewer 1 mentioned, RNA abundance in scRNA-seq is higher than in snRNA-seq. To conduct scRNA-seq using plant cells, protoplasting is the necessary step. However, in our study, protoplasting has many drawbacks in isolating our target cells from the phloem. First, it is technically challenging to efficiently isolate protoplasts from highly embedded phloem companion cells from plant tissues. Typically, at least several hours of enzymatic incubation are required to obtain protoplasts from companion cells (often using semi-isolated vasculatures), and the efficiency of protoplasting vasculature cells remains low. Secondly, for our analysis, restoring the time information within a day is also crucial. Therefore, we employed a more rapid isolation method. In the revision, we will explain our rationale for choosing snRNA-seq due to the technical limitations. In the revised manuscripts, we added four new sentences in the Introduction section to clearly explain these points.

      Reviewer 1 also raised a concern about the quality of our snRNA-seq data, referring to the relatively low readcounts per nucleus. Although we believe that shallow reads do not necessarily indicate low quality and are confident in the accuracy of our snRNA-seq data, as supported by the detailed follow-up experiments (e.g., imaging analysis in Fig. 4B), we agree that it is important to address this point in the revision and alleviate readers’ concerns regarding the data quality. 

      We believe the primary reason for the low readcounts per cell is the small amount of RNA present in each Arabidopsis vascular cell nucleus that we isolated. For bulk nuclei RNAseq, we collected 15,000 nuclei. However, the total RNA amount was approximately 3 ng. It indicates that each nucleus isolated contains a very limited amount of RNA (by the simple calculation, 3,000 pg / 15,000 nuclei = 0.2 pg/nucleus). It appears that the size of cells and nuclei was still small in 2-week-old seedlings; thus, each nucleus may contain lower levels of RNA. During the optimization process, we also tried to fix the tissues that we hoped to restore nuclear retained RNA, but unfortunately, in our hands, we encountered the technical issue of nuclei aggregation that hindered the sorting process, which is not suitable for single-nucleus RNA-seq.

      Reviewer 1 suggested that we repeat the same snRNA-seq experiment. We agree that having more cells increases the reliability of data. However, to our knowledge, higher cell numbers enhance the confidence of clustering, but not readcounts per cell. In our snRNAseq data, our target, FT-expressing cells, were observed in cluster 7, which projected at an obvious distance from other cell clusters. Therefore, we think that having more nuclei does not significantly help in separating high FT-expressing cluster 7 cells and different types of cells, although we may obtain more DEGs from the cluster 7 cells. Considering the costs and time required for additional snRNA-seq experiments, we think that adding more followup molecular biology experiment data would be more practical. We clearly stated the limitations of our approach in the Discussion section. “A drawback of our snRNA-seq analysis was shallow reads per nucleus. It appears mainly due to the low abundance of mRNA in nuclei from 2-week-old leaves. Based on our calculation, the average mRNA level per nucleus is approximately 0.2 pg (3,000 pg mRNA from 15,000 sorted nuclei). Future technological advance is needed to improve the data quality“

      In this revised version of the manuscript, we silenced FT gene expression using an amiRNA against FT driven by tissue-specific promoters [pROXY10, cluster 7; pSUC2, companion cells; pPIP2.6, cluster 4 (for the spatial expression pattern of PIP2.6, please see the new data shown in Fig. S8F); pGC1, guard cells]. Given that both FT and ROXY10 were highly expressed in cluster 7 of our snRNA-seq dataset, we anticipated the late flowering phenotype of pROXY10:amiRNA-ft. As we expected, pROXY10:amiR-ft but not pPIP2.6:amiR-ft lines showed delayed flowering phenotypes (Fig. S14A), supporting the validity of our snRNA-seq approach. We are also now more confident in the resolution of our snRNA-seq analysis, since cluster 4-specific PIP2.6 did not cause late flowering despite its higher basal expression than ROXY10 (Fig. S14B).

      (3) Another disappointment is that the authors did not utilize reporter genes to identify the specific locations of the FT-high expressing cells (cluster 7 cells) within the CC population in vivo. Are there any discernible patterns that can be observed? 

      In the original manuscript, as we showed only limited spatial images of overlap between FT and other cluster 7 genes in Fig. 4B, this comment is totally understandable. To respond to it, we added whole leaf images showing the spatial expression of FT and other cluster 7 genes (Fig. S12). These data indicate that cluster 7 genes including FT are expressed highly in minor veins in the distal part of the leaf but weakly in the main vein. We also added enlarged images of spatial expression of FT and cluster 7 genes (FLP1 and ROXY10) to note that those genes do not overlap completely (Fig. S13).

      In contrast to cluster 7 genes, genes highly expressed in cluster 4, such as LTP1 and MLP28, are reportedly highly expressed in the main leaf vein. To further confirm it, we established a transgenic line that expresses a GFP-fusion protein controlled by the promoter of a cluster 4-specific gene PIP2.6 (Fig. S8F). It also showed strong GFP signals in the main vein, consistent with previous observations of LTP1 and MLP28.   In summary, FT-expressing cells (cluster 7 cells) are enriched in companion cells in the minor vein, and their expression patterns show a clear distinction from genes expressed in the main vein (e.g., cluster 4-specific genes). 

      (4) The final disappointment is that the authors only compared FT expression between the nigtQ mutants and the wild type. Does this imply that the mutant does not have a flowering time defect particularly under high nitrogen conditions? 

      We agree with reviewer 1 that more experiments are required to conclude the role of NIGT1 on FT regulation, in addition to our Y1H data, flowering time data of NIGT1 overexpressors, and FT expression in NIGT1 overexpressors and nigtQ mutant.

      First, to test the direct regulation of NIGT1s on FT transcription, we conducted a transient luciferase (LUC) assay in tobacco leaves using effectors (p35S:NIGT1.2, p35S:NIGT1.4, and p35S:GFP) and reporters [pFT:LUC (FT promoter fused with LUC) and pFTm:LUC (the same FT promoter with mutations in NIGT1-binding sites fused with LUC)]. Our result showed that NIGT1.2 and NIGT1.4, but not GFP, decreased the activity of pFT:LUC but not pFTm:LUC (Fig. 5C). This indicates that NIGT1s directly repress the FT gene.

      Second, to address reviewer 1’s suggestion about the effect of of nigtQ mutation on flowering time, we have grown WT and nigtQ plants on 20 mM and 2 mM NH<sub>4</sub>NO<sub>3</sub>. Under 20 mM NH<sub>4</sub>NO<sub>3</sub>, the nigtQ line bolted at earlier days than WT; under 2 mM NH<sub>4</sub>NO<sub>3</sub>, nigtQ and WT bolted at almost same timing (Fig. S17D and E). This result suggests that the nigtQ mutation affects flowering timing depending on nitrogen nutrient status. However, leaf numbers of bolted plants were not different between WT and nigtQ lines (Fig. S17E). Therefore, it appears that nigtQ mutation also accelerated overall growth of plants rather than flowering promotion. We also have measured flowering time by counting leaf numbers of the nigtQ and WT plants at bolting on nitrogen-rich soil. The mutant generated slightly more leaves than WT when they flowered (Fig. S17G). These results suggest that the NIGT-derived fine-tuning of FT regulation is conditional on higher nitrogen conditions. 

      Minor: 

      (1) Abstract: "Our bulk nuclei RNA-seq demonstrated that FT-expressing cells in cotyledons and in true leaves differed transcriptionally.". This sentence is not informative. What exactly is the difference in FT-expressing cells between cotyledons and true leaves? 

      We modified the sentence to clarify the differences between cotyledons and true leaves. “Our bulk nuclei RNA-seq demonstrated that FT-expressing cells in cotyledons and true leaves showed differences especially in FT repressor genes.”

      (2) As a standard practice, to support the direct regulation of FT by NIGT1, the authors should provide EMSA and ChIP-seq data. Ideally, they should also generate promoter constructs with deletions or mutations in the NIGT1 binding sites. 

      To test direct interaction of NIGT1 to the FT promoter sequences, we performed the transient reporter assay using FT promoter driven luciferase reporter (Fig. 5C). NIGT1.2 and NIGT1.4 repressed the FT promoter activity; however, with NIGT1 binding site mutations, this repression was not observed, indicating that NIGT1 binds to the ciselements in the FT promoter to repress its transcription.

      (3) Sorting: Did the authors fix the samples before preparing the nuclei suspension? If not, could this be the reason the authors observed the JA-responsive clusters (Fig. 2J)? Please provide more details related to nuclei sorting in the Methods section. 

      We added a new subsection in the Materials and Methods section to explain a detail of the nuclei sorting procedure. We did not include a sample fixation step. We have tried formaldehyde fixation; however, it clumped nuclei, which was not suitable for snRNA-seq. Moreover, fixation steps generally reduce readcounts of single-cell RNA-seq according to the 10X Genomics’ guideline.

      We agree that JA responses were triggered during the FANS nuclei isolation. Therefore, we added the following sentence. “Since our FANS protocol did not include a sample fixation step to avoid clumping, these cells likely triggered wounding responses during the chopping and sorting process (Fig. S1B).  

      Reviewer #2 (Public review): 

      This manuscript submitted by Takagi et al. details the molecular characterization of the FTexpressing cell at a single-cell level. The authors examined what genes are expressed specifically in FT-expressing cells and other phloem companion cells by exploiting bulk nuclei and single-nuclei RNA-seq and transgenic analysis. The authors found the unique expression profile of FT-expressing cells at a single-cell level and identified new transcriptional repressors of FT such as NIGT1.2 and NIGT1.4. 

      Although previous researchers have known that FT is expressed in phloem companion cells, they have tended to neglect the molecular characterization of the FT-expressing phloem companion cells. To understand how FT, which is expressed in tiny amounts in phloem companion cells that make up a very small portion of the leaf, can be a key molecule in the regulation of the critical developmental step of floral transition, it is important to understand the molecular features of FT-expressing cells in detail. In this regard, this manuscript provides insight into the understanding of detailed molecular characteristics of the FT-expressing cell. This endeavor will contribute to the research field of flowering time. 

      We are grateful that reviewer 2 recognizes the importance of transcriptome profiling of FTexpressing cells at the single-cell level.

      Here are my comments on how to improve this manuscript. 

      (1) The most noble finding of this manuscript is the identification of NTGI1.2 as the upstream regulator of FT-expressing cluster 7 gene expression. The flowering phenotypes of the nigtQ mutant and the transgenic plants in which NIGT1.2 was expressed under the SUC2 gene promoter support that NIGT1.2 functions as a floral repressor upstream of the FT gene. Nevertheless, the expression patterns of NIGT1.2 genes do not appear to have much overlap with those of NIGT1.2-downstream genes in the cluster 7 (Figs S14 and F3). An explanation for this should be provided in the discussion section. 

      We agree with reviewer 2 that the spatial expression patterns of NIGT1.2 and cluster 7 genes do not overlap much, and some discussion should be provided in the manuscript. Although we do not have a concrete answer for this phenomenon, we obtained the new data showing that NIGT1.2 and NIGT1.4 directly repress the FT gene in planta (Fig. 5C).  As NIGT1.2/1.4 are negative regulators of FT, it is plausible that NIGT1.2/1.4 may suppress FT gene expression in non-cluster 7 cells to prevent the misexpression of FT. We added this point in the Results section.

      (2) To investigate gene expression in the nuclei of specific cell populations, the authors generated transgenic plants expressing a fusion gene encoding a Nuclear Targeting Fusion protein (NTF) under the control of various cell type-specific promoters. Since the public audience would not know about NTF without reading reference 16, some explanation of NTF is necessary in the manuscript. Please provide a schematic of constructs the authors used to make the transformants.

      As reviewer 2 pointed out, we lacked a clear explanation of why we used NTF in this study. NTF is the fusion protein that consists of a nuclear envelope targeting WPP domain, GFP, and a biotin acceptor peptide. It was initially designed for the INTACT (isolation of nuclei tagged in specific cell types) method, which enables us to isolate bulk nuclei from specific tissues. Although our original intention was to profile the bulk transcriptome of mRNAs that exist in nuclei of the FT-expressing cells using INTACT, we utilized our NTF transgenic lines for snRNA-seq analysis. To explain what NTF is to readers, we included a schematic diagram of NTF (Fig. S1A) and more explanation about NTF in the Results section.

      Again, we appreciate all reviewers’ careful and constructive comments. With these changes, we hope our revised manuscript is now satisfactory.

    1. Author response:

      The following is the authors’ response to the previous reviews

      Reviewer #1 (Public review): 

      Summary: 

      The study by Klug et al. investigated the pathway specificity of corticostriatal projections, focusing on two cortical regions. Using a G-deleted rabies system in D1-Cre and A2a-Cre mice to retrogradely deliver channelrhodopsin to cortical inputs, the authors found that M1 and MCC inputs to direct and indirect pathway spiny projection neurons (SPNs) are both partially segregated and asymmetrically overlapping. In general, corticostriatal inputs that target indirect pathway SPNs are likely to also target direct pathway SPNs, while inputs targeting direct pathway SPNs are less likely to also target indirect pathway SPNs. Such asymmetric overlap of corticostriatal inputs has important implications for how the cortex itself may determine striatal output. Indeed, the authors provide behavioral evidence that optogenetic activation of M1 or MCC cortical neurons that send axons to either direct or indirect pathway SPNs can have opposite effects on locomotion and different effects on action sequence execution. The conclusions of this study add to our understanding of how cortical activity may influence striatal output and offer important new clues about basal ganglia function. 

      The conceptual conclusions of the manuscript are supported by the data, but the details of the magnitude of afferent overlap and causal role of asymmetric corticostriatal inputs on some behavioral outcomes may be a bit overstated given technical limitations of the experiments. 

      For example, after virally labeling either direct pathway (D1) or indirect pathway (D2) SPNs to optogenetically tag pathway-specific cortical inputs, the authors report that a much larger number of "non-starter" D2-SPNs from D2-SPN labeled mice responded to optogenetic stimulation in slices than "non-starter" D1 SPNs from D1-SPN labeled mice did. Without knowing the relative number of D1 or D2 SPN starters used to label cortical inputs, it is difficult to interpret the exact meaning of the lower number of responsive D2-SPNs in D1 labeled mice (where only ~63% of D1-SPNs themselves respond) compared to the relatively higher number of responsive D1-SPNs (and D2-SPNs) in D2 labeled mice. While relative differences in connectivity certainly suggest that some amount of asymmetric overlap of inputs exists, differences in infection efficiency and ensuing differences in detection sensitivity in slice experiments make determining the degree of asymmetry problematic. 

      It is also unclear if retrograde labeling of D1-SPN- vs D2-SPN- targeting afferents labels the same densities of cortical neurons. This gets to the point of specificity in some of the behavioral experiments. If the target-based labeling strategies used to introduce channelrhodopsin into specific SPN afferents label significantly different numbers of cortical neurons, might the difference in the relative numbers of optogenetically activated cortical neurons itself lead to behavioral differences? 

      We thank the reviewer for the comments and for raising additional interpretations of our results. We agree that determining the relative number of D1- versus D2-SPN starter cells would allow a more accurate estimate of connectivity. However, due to current technical limitations, achieving this level of precision remains challenging. As the reviewer also noted, differences in the number of cortical neurons targeting D1- versus D2-SPNs could introduce additional complexity to the functional effects observed in the behavioral experiments. Moreover, functional heterogeneity is likely to exist not only among cortical neurons projecting to striatal D1- or D2-SPNs, but also within the striatal D1- and D2-SPN populations themselves. Addressing these questions at the single-neuron level will require more refined viral tools in combination with improved recording and manipulation techniques. Despite these limitations, our results suggest that a subpopulation of cortical neurons selectively targets striatal D1-SPNs, supporting a functional dichotomy of pathway-specific corticostriatal subcircuits in the control of behavior.   

      Reviewer #2 (Public review): 

      Summary: 

      Klug et al. use monosynaptic rabies tracing of inputs to D1- vs D2-SPNs in the striatum to study how separate populations of cortical neurons project to D1- and D2-SPNs. They use rabies to express ChR2, then patch D1-or D2-SPNs to measure synaptic input. They report that cortical neurons labeled as D1-SPN-projecting preferentially project to D1-SPNs over D2-SPNs. In contrast, cortical neurons labeled as D2-SPN-projecting project equally to D1- and D2-SPNs. They go on to conduct pathway-specific behavioral stimulation experiments. They compare direct optogenetic stimulation of D1- or D2-SPNs to stimulation of MCC inputs to DMS and M1 inputs to DLS. In three different behavioral assays (open field, intra-cranial self-stimulation, and a fixed ratio 8 task), they show that stimulating MCC or M1 cortical inputs to D1-SPNs is similar to D1-SPN stimulation, but that stimulating MCC or M1 cortical inputs to D2-SPNs does not recapitulate the effects of D2-SPN stimulation (presumably because both D1- and D2-SPNs are being activated by these cortical inputs). 

      Strengths: 

      Showing these same effects in three distinct behaviors is strong. Overall, the functional verification of the consequences of the anatomy is very nice to see. It is a good choice to patch only from mCherry-negative non-starter cells in the striatum. This study adds to our understanding of the logic of corticostriatal connections, suggesting a previously unappreciated structure. 

      Weaknesses: 

      One limitation is that all inputs to SPNs are expressing ChR2, so they cannot distinguish between different cortical subregions during patching experiments. Their results could arise because the same innervation patterns are repeated in many cortical subregions or because some subregions have preferential D1-SPN input while others do not. 

      Thank you for raising this thoughtful concern. It is indeed not feasible to restrict ChR2 expression to a specific cortical region using the first-generation rabies-ChR2 system alone. A more refined approach would involve injecting Cre-dependent TVA and RG into the striatum of D1- or A2A-Cre mice, followed by rabies-Flp infection. Subsequently, a Flp-dependent ChR2 virus could be injected into the MCC or M1 to selectively label D1- or D2-projecting cortical neurons. This strategy would allow for more precise targeting and address many of the current limitations.

      However, a significant challenge lies in the cytotoxicity associated with rabies virus infection. Neuronal health begins to deteriorate substantially around 10 days post-infection, which provides an insufficient window for robust Flp-dependent ChR2 expression. We have tested several new rabies virus variants with extended survival times (Chatterjee et al., 2018; Jin et al., 2024), but unfortunately, they did not perform effectively or suitably in the corticostriatal systems we examined.

      In our experimental design, the aim is to delineate the connectivity probabilities to D1 or D2-SPNs from cortical neurons. Our hypothesis considered includes the possibility that similar innervation patterns could occur across multiple cortical subregions, or that some subregions might show preferential input to D1-SPNs while others do not, or a combination of both scenarios. This leads us to perform a series behavior test that using optogenetic activation of the D1- or D2-projecting cortical populations to see which could be the case.

      In the cortical areas we examined, MCC and M1, during behavioral testing, there is consistency with our electrophysiological results. Specifically, when we stimulated the D1-projecting cortical neurons either in MCC or in M1, mice exhibited facilitated local motion in open field test, which is the same to the activation of D1 SPNs in the striatum along (MCC: Fig 3C & D vs. I; M1: Fig 3F & G vs. L). Conversely, stimulation of D2-projecting MCC or M1 cortical neurons resulted in behavioral effects that appeared to combine characteristics of both D1- and D2-SPNs activation in the striatum (MCC: Fig 3C & D vs. J; M1: Fig 3F & G vs. M). The similar results were observed in the ICSS test. Our interpretation of these results is that the activation of D1-projecting neurons in the cortex induces behavior changes akin to D1 neuron activation, while activation of D2-projecting neurons in the cortex leads to a combined effect of both D1 and D2 neuron activation. This suggests that at least some cortical regions, the ones we tested, follow the hypothesis we proposed.

      There are also some caveats with respect to the efficacy of rabies tracing. Although they only patch non-starter cells in the striatum, only 63% of D1-SPNs receive input from D1-SPN-projecting cortical neurons. It's hard to say whether this is "high" or "low," but one question is how far from the starter cell region they are patching. Without this spatial indication of where the cells that are being patched are relative to the starter population, it is difficult to interpret if the cells being patched are receiving cortical inputs from the same neurons that are projecting to the starter population. The authors indicate they are patching from mCherry-negative neurons within the region of the mCherry-positive neurons, but since the mCherry population will include both true starter cells and monosynaptically connected cells, this is not perfectly precise. Convergence of cortical inputs onto SPNs may vary with distance from the starter cell region quite dramatically, as other mapping studies of corticostriatal inputs have shown specialized local input regions can be defined based on cortical input patterns (Hintiryan et al., Nat Neurosci, 2016, Hunnicutt et al., eLife 2016, Peters et al., Nature, 2021). 

      This is a valid concern regarding anatomical studies. Investigating cortico-striatal connectivity at the single-cell level remains technically challenging due to current methodological limitations. At present, we rely on rabies virus-mediated trans-synaptic retrograde tracing to identify D1- or D2-projecting cortical populations. This anatomical approach is coupled with ex vivo slice electrophysiology to assess the functional connectivity between these projection-defined cortical neurons and striatal SPNs. This enables us to quantify connection ratios, for example, the proportion of D1-projecting cortical neurons that functionally synapse onto non-starter D1-SPNs.

      To ensure the robustness of our conclusions, it is essential that both the starter cells and the recorded non-starter SPNs receive comparable topographical input from the cortex and other brain regions. Therefore, we carefully designed our experiments so that all recorded cells were located within the injection site, were mCherry-negative (i.e., non-starter cells), and were surrounded by ChR2-mCherry-positive neurons. This configuration ensured that the distance between recorded and starter cells did not exceed 100 µm, maintaining close anatomical proximity and thereby preserving the likelihood of shared cortical innervation within the examined circuitry.

      These methodological details are also described in the section on ex vivo brain slice electrophysiology, specifically in the Methods section, lines 453–459:

      “D1-SPNs (eGFP-positive in D1-eGFP mice, or eGFP-negative in D2-eGFP mice) or D2-SPNs (eGFP-positive in D2-eGFP mice, or eGFP-negative in D1-eGFP mice) that were ChR2-mCherry-negative, but in the injection site and surrounded by cells expressing ChR2-mCherry were targeted for recording. This configuration ensured that the distance between recorded and starter cells did not exceed 100 µm, maintaining close anatomical proximity and thereby preserving the likelihood of shared cortical innervation within the examined circuitry.”

      This experimental strategy was implemented to control for potential spatial biases and to enhance the interpretability of our connectivity measurements.

      A caveat for the optogenetic behavioral experiments is that these optogenetic experiments did not include fluorophore-only controls, although a different control (with light delivered in M1) is provided in Supplementary Figure 3. Another point of confusion is that other studies (Cui et al, J Neurosci, 2021) have reported that stimulation of D1-SPNs in DLS inhibits rather than promotes movement. This study may have given different results due to subtly different experimental parameters, including fiber optic placement and NA.

      We appreciate the reviewer’s thoughtful evaluation and comments. We have added a short discussion of Cui et al.’s study on optogenetic stimulation of D1-SPNs in the DLS (lines 341-343), which reports findings that contrast with ours and those of other studies.

      Reviewer #3 (Public review): 

      Review of resubmission: The authors provided a response to the reviews from myself and other reviewers. While some points were made satisfactorily, particularly in clarification of the innervation of cortex to striatum and the effects of input stimulation, many of my points remain unaddressed. In several cases, the authors chose to explain their rationale rather than address the issues at hand. A number of these issues (in fact, the majority) could be addressed simply by toning done the confidence in conclusions, so it was disappointing to see that the authors by and large did not do this. I repeat my concerns below and note whether I find them to have been satisfactorily addressed or not. 

      In the manuscript by Klug and colleagues, the investigators use a rabies virus-based methodology to explore potential differences in connectivity from cortical inputs to the dorsal striatum. They report that the connectivity from cortical inputs onto D1 and D2 MSNs differs in terms of their projections onto the opposing cell type, and use these data to infer that there are differences in cross-talk between cortical cells that project to D1 vs. D2 MSNs. Overall, this manuscript adds to the overall body of work indicating that there are differential functions of different striatal pathways which likely arise at least in part by differences in connectivity that have been difficult to resolve due to difficulty in isolating pathways within striatal connectivity, and several interesting and provocative observations were reported. Several different methodologies are used, with partially convergent results, to support their main points. 

      However, I have significant technical concerns about the manuscript as presented that make it difficult for me to interpret the results of the experiments. My comments are below. 

      Major: 

      There is generally a large caveat to the rabies studies performed here, which is that both TVA and the ChR2-expressing rabies virus have the same fluorophore. It is thus essentially impossible to determine how many starter cells there are, what the efficiency of tracing is, and which part of the striatum is being sampled in any given experiment. This is a major caveat given the spatial topography of the cortico-striatal projections. Furthermore, the authors make a point in the introduction about previous studies not having explored absolute numbers of inputs, yet this is not at all controlled in this study. It could be that their rabies virus simply replicates better in D1-MSNs than D2-MSNs. No quantifications are done, and these possibilities do not appear to have been considered. Without a greater standardization of the rabies experiments across conditions, it is difficult to interpret the results. 

      This is still an issue. The authors point out why they chose various vectors. I can understand why the authors chose the fluorophores etc. that they did, yet the issues I raised previously are still valid. The discussion should mention that this is a potential issue. It does not necessarily invalidate results, but it is an issue. Furthermore, it is possible (in all systems) that rabies replicates better/more efficiently in some cells than others. This is one possible interpretation that has not really been explored in any study. I don't suggest the authors attempt to do that, but it should be raised as a potential interpretation. If the rabies results could mean several different things, the authors owe it to the readership to state all possible interpretations of data.

      We thank the reviewer for the comments and suggestions. Because the same fluorophore (mCherry) was used in both TVA- and ChR2-expressing viruses, it was not possible to distinguish true starter SPNs from TVA-only SPNs or monosynaptically labeled SPNs. This limitation makes it difficult to precisely assess the efficiency of rabies labeling and retrograde tracing in our experimental setup. Moreover, differences in rabies replication efficiency between D1- and D2-SPNs could potentially lead to an apparent lower connection probability from D1-projecting cortical neurons to D2-SPNs than from D2-projecting cortical neurons to D1-SPNs. We have added this clarification to the Discussion (lines 280-297).

      The authors claim using a few current clamp optical stimulation experiments that the cortical cells are healthy, but this result was far from comprehensive. For example, membrane resistance, capacitance, general excitability curves, etc are not reported. In Figure S2, some of the conditions look quite different (e.g., S2B, input D2-record D2, the method used yields quite different results that the authors write off as not different). Furthermore, these experiments do not consider the likely sickness and death that occurs in starter cells, as has been reported elsewhere. Health of cells in the circuit is overall a substantial concern that alone could invalidate a large portion, if not all, of the behavioral results. This is a major confound given those neurons are thought to play critical roles in the behaviors being studied. This is a major reason why first-generation rabies viruses have not been used in combination with behavior, but this significant caveat does not appear to have been considered, and controls e.g., uninfected animals, infected with AAV helpers, etc, were not included. 

      This issue remains unaddressed. I did not request clarity about experimental design, but rather, raised issues about the potential effects of toxicity. I believe this to be a valid concern that needs to be discussed in the manuscript, especially given what look visually like potential differences in S2. 

      We understand and appreciate the reviewer’s concern regarding the potential cytotoxicity of rabies virus infection. Although we performed the in vivo optogenetic behavioral experiments during a period when rabies-infected cells are generally considered relatively healthy, some deficits in starter cells may still occur and could contribute to the observed effects of optogenetic cortical stimulation. We have added this clarification to the Discussion (lines 298-306).

      The overall purity (e.g., EnvA pseudotyping efficiency) of the RABV prep is not shown. If there was a virus that was not well EnvA-pseudotyped and thus could directly infect cortical (or other) inputs, it would degrade specificity. This issue has not been addressed. Viral strain is irrelevant. The quality of the specific preparations used is what matters.

      While most of the study focuses on the cortical inputs, in slice recordings, inputs from the thalamus are not considered, yet likely contribute to the observed results. Related to this, in in vivo optogenetic experiments, technically, if the thalamic or other inputs to the dorsal striatum project to the cortex, their method will not only target cortical neurons but also terminals of other excitatory inputs. If this cannot be ruled it, stating that the authors are able to selectively activate the cortical inputs to one or the other population should be toned down. 

      The authors added text to the discussion to address this point. While it largely does what is intended, based on the one study cited, I disagree with the authors' conclusions that it is "clear" that potential contamination from other sites does not play a role. The simplest interpretation is the one the authors state, and there is some supporting evidence to back up that assertion, but to me that falls short of making the point "clear" that there are no other interpretations. 

      The statements about specificity of connectivity are not well founded. It may be that in the specific case where they are assessing outside of the area of injections, their conclusions may hold (e.g., excitatory inputs onto D2s have more inputs onto D1s than vice versa). However, how this relates to the actual site of injection is not clear. At face value, if such a connectivity exists, it would suggest that D1-MSNs receive substantially more overall excitatory inputs than D2s. It is thus possible that this observation would not hold over other spatial intervals. This was not explored and thus the conclusions are over-generalized. e.g., the distance from the area of red cells in the striatum to recordings was not quantified, what constituted a high level of cortical labeling was not quantified, etc. Without more rigorous quantification of what was being done, it is difficult to interpret the results. 

      Again, the goal here would be to make a statement about this in the discussion to clarify limitations of the study. I don't expect the authors to re-do all of these experiments, but since they are discussing the corticostriatal circuits, which have multiple subdomains, this remains a relevant point. It has not been addressed. 

      The results in Figure 3 are not well controlled. The authors show contrasting effects of optogenetic stimulation of D1-MSNs and D2-MSNs in the DMS and DLS, results which are largely consistent with the canon of basal ganglia function. However, when stimulating cortical inputs, stimulating the inputs from D1-MSNs gives the expected results (increased locomotion) while stimulating putative inputs to D2-MSNs had no effect. This is not the same as showing a decrease in locomotion - showing no effect here is not possible to interpret. 

      I think that the caveat of showing no clear effects of inputs to D2 stimulation should be pointed out. Yes, I understand that the viruses appeared to express etc., but again it remains possible that the results are driven by a lack of e.g., sufficient ChR2 expression. Aside from a full quantification of the number of cells expressing ChR2, overlap in fiber placement and ChR2 expression (which I don't suggest), this remains a possibility and should be pointed out, as it remains a possibility. 

      In the light of their circuit model, the result showing that inputs to D2-MSNs drive ICSS is confusing. How can the authors account for the fact that these cells are not locomotor-activating, stimulation of their putative downstream cells (D2-MSNs) does not drive ICSS, yet the cortical inputs drive ICSS? Is the idea that these inputs somehow also drive D1s? If this is the case, how do D2s get activated, if all of the cortical inputs tested net activate D1s and not D2s? Same with the results in Figure 4 - the inputs and putative downstream cells do not have the same effects. Given potential caveats of differences in viral efficiency, spatial location of injections, and cellular toxicity, I cannot interpret these experiments. 

      The explanation the authors provide in their rebuttal makes sense, however this should be included in the discussion of the manuscript, as it is interesting and relevant. 

      We thank the reviewer for the valuable comments and suggestions. In line with the reviewer’s recommendation, we have incorporated these explanations into the Discussion (lines 242–279) to help interpret the complex behavioral outcomes of optogenetic stimulation of cortical neurons projecting to D1- or D2-SPNs.

      Reviewer #2 (Recommendations for the authors): 

      I appreciate the authors' responses, which helped clarify some experimental choices. I appreciate that the experiment in Fig S3 serves as a reasonable light control for optogenetics experiments. The careful comparison with methods in Cui et al (2021) is useful, although not added to the main manuscript. Some of the other citations here don't really address the controversy, e.g. Kravitz at al is in DMS, but perhaps fully addressing this issue is outside the scope of the current manuscript and awaits further experiments. I also appreciate the clarification for recording locations that "This configuration ensured that the distance between recorded and starter cells did not exceed 100 µm, maintaining close anatomical proximity and thereby preserving the likelihood of shared cortical innervation within the examined circuitry." However, the statement in the reviewer response does not seem to be added to the manuscript's methods, which I think would be helpful. The criteria for choosing recorded cells are still a bit fuzzy without a map of recording locations and histology. There is also a problem that mCherry-positive cells could be starter cells or could be monosynaptically traced cells, so it is hard to know the area of the starter cell population in these experiments for sure. My evaluation of the manuscript remains largely the same as the original. However, I have adjusted my public review a bit to incorporate the authors' responses. I still think this paper has valuable information, suggesting an interesting and previously unappreciated structure of corticostriatal inputs that I hope this group and others will continue to investigate and incorporate into models of basal ganglia function.

      We thank the reviewer for the valuable suggestions. We have now included a comparison with Cui et al. in the Discussion. In addition, we have added the criteria for selecting recorded cells to the Methods section: ‘This configuration ensured that the distance between recorded and starter cells did not exceed 100 µm, maintaining close anatomical proximity and thereby preserving the likelihood of shared cortical innervation within the examined circuitry.’

    1. Author Response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      Summary: 

      This paper applies methods for segmentation, annotation, and visualization of acoustic analysis to zebra finch song. The paper shows that these methods can be used to predict the stage of song development and to quantify acoustic similarity. The methods are solid and are likely to provide a useful tool for scientists aiming to label large datasets of zebra finch vocalizations. The paper has two main parts: 1) establishing a pipeline/ package for analyzing zebra finch birdsong and 2) a method for measuring song imitation. 

      Strengths: 

      It is useful to see existing methods for syllable segmentation compared to new datasets.

      It is useful, but not surprising, that these methods can be used to predict developmental stage, which is strongly associated with syllable temporal structure.

      It is useful to confirm that these methods can identify abnormalities in deafened and isolated songs. 

      Weaknesses: 

      For the first part, the implementation seems to be a wrapper on existing techniques. For instance, the first section talks about syllable segmentation; they made a comparison between whisperseg (Gu et al, 2024), tweetynet (Cohen et al, 2022), and amplitude thresholding. They found that whisperseg performed the best, and they included it in the pipeline. They then used whisperseg to analyze syllable duration distributions and rhythm of birds of different ages and confirmed past findings on this developmental process (e.g. Aronov et al, 2011). Next, based on the segmentation, they assign labels by performing UMAP and HDBScan on the spectrogram (nothing new; that's what people have been doing). Then, based on the labels, they claimed they developed a 'new' visualization - syntax raster ( line 180 ). That was done by Sainburg et. al. 2020 in Figure 12E and also in Cohen et al, 2020 - so the claim to have developed 'a new song syntax visualization' is confusing. The rest of the paper is about analyzing the finch data based on AVN features (which are essentially acoustic features already in the classic literature). 

      First, we would like to thank this reviewer for their kind comments and feedback on this manuscript. It is true that many of the components of this song analysis pipeline are not entirely novel in isolation. Our real contribution here is bringing them together in a way that allows other researchers to seamlessly apply automated syllable segmentation, clustering, and downstream analyses to their data. That said, our approach to training TweetyNet for syllable segmentation is novel. We trained TweetyNet to recognize vocalizations vs. silence across multiple birds, such that it can generalize to new individual birds, whereas Tweetynet had only ever been used to annotate song syllables from birds included in its training set previously. Our validation of TweetyNet and WhisperSeg in combination with UMAP and HDBSCAN clustering is also novel, providing valuable information about how these systems interact, and how reliable the completely automatically generated labels are for downstream analysis. We have added a couple sentences to the introduction to emphasize the novelty of this approach and validation.

      Our syntax raster visualization does resemble Figure 12E in Sainburg et al. 2020, however it differs in a few important ways, which we believe warrant its consideration as a novel visualization method. First, Sainburg et al. represent the labels across bouts in real time; their position along the x axis reflects the time at which each syllable is produced relative to the start of the bout. By contrast, our visualization considers only the index of syllables within a bout (ie. First syllable vs. second syllable etc) without consideration of the true durations of each syllable or the silent gaps between them. This makes it much easier to detect syntax patterns across bouts, as the added variability of syllable timing is removed. Considering only the sequence of syllables rather than their timing also allows us to more easily align bouts according to the first syllable of a motif, further emphasizing the presence or absence of repeating syllable sequences without interference from the more variable introductory notes at the start of a motif. Finally, instead of plotting all bouts in the order in which they were produced, our visualization orders bouts such that bouts with the same sequence of syllables will be plotted together, which again serves to emphasize the most common syllable sequences that the bird produces. These additional processing steps mean that our syntax raster plot has much starker contrast between birds with stereotyped syntax and birds with more variable syntax, as compared to the more minimally processed visualization in Sainburg et al. 2020. There doesn’t appear to be any similar visualizations in Cohen et al. 2020. 

      The second part may be something new, but there are opportunities to improve the benchmarking. It is about the pupil-tutor imitation analysis. They introduce a convolutional neural network that takes triplets as an input (each tripled is essentially 3 images stacked together such that you have (anchor, positive, negative), Anchor is a reference spectrogram from, say finch A; positive means a different spectrogram with the same label as anchor from finch A, and negative means a spectrogram not related to A or different syllable label from A. The network is then trained to produce a low-dimensional embedding by ensuring the embedding distance between anchor and positive is less than anchor and negative by a certain margin. Based on the embedding, they then made use of earth mover distance to quantify the similarity in the syllable distribution among finches. They then compared their approach performance with that of sound analysis pro (SAP) and a variant of SAP. A more natural comparison, which they didn't include, is with the VAE approach by Goffinet et al. In this paper (https://doi.org/10.7554/eLife.67855, Fig 7), they also attempted to perform an analysis on the tutor pupil song.  

      We thank the reviewer for this suggestion. We have included a comparison of our triplet loss embedding model to the VAE model proposed in Goffinet et al. 2021. We also included comparisons of similarity scoring using each of these embedding models combined with either earth mover’s distance (EMD) or maximum mean discrepancy (MMD) to calculate the similarity of the embeddings, as was done in Goffinet et al. 2021. As discussed in the updated results section of the paper and shown in the new Figure 6–figure supplement 1, the Triplet loss model with MMD performs best for evaluating song learning on new birds, not included in model training. We’ve updated the main text of the paper to reflect this switch from EMD to MMD for the primary similarity scoring approach.

      Reviewer #2 (Public Review):

      Summary: 

      In this work, the authors present a new Python software package, Avian Vocalization Network (AVN) aimed at facilitating the analysis of birdsong, especially the song of the zebra finch, the most common songbird model in neuroscience. The package handles some of the most common (and some more advanced) song analyses, including segmentation, syllable classification, featurization of song, calculation of tutor-pupil similarity, and age prediction, with a view toward making the entire process friendlier to experimentalists working in the field.

      For many years, Sound Analysis Pro has served as a standard in the songbird field, the first package to extensively automate songbird analysis and facilitate the computation of acoustic features that have helped define the field. More recently, the increasing popularity of Python as a language, along with the emergence of new machine learning methods, has resulted in a number of new software tools, including the vocalpy ecosystem for audio processing, TweetyNet (for segmentation), t-SNE and UMAP (for visualization), and autoencoder-based approaches for embedding.

      Strengths: 

      The AVN package overlaps several of these earlier efforts, albeit with a focus on more traditional featurization that many experimentalists may find more interpretable than deep learning-based approaches. Among the strengths of the paper are its clarity in explaining the several analyses it facilitates, along with high-quality experiments across multiple public datasets collected from different research groups. As a software package, it is open source, installable via the pip Python package manager, and features high-quality documentation, as well as tutorials. For experimentalists who wish to replicate any of the analyses from the paper, the package is likely to be a useful time saver.

      Weaknesses: 

      I think the potential limitations of the work are predominantly on the software end, with one or two quibbles about the methods.

      First, the software: it's important to note that the package is trying to do many things, of which it is likely to do several well and few comprehensively. Rather than a package that presents a number of new analyses or a new analysis framework, it is more a codification of recipes, some of which are reimplementations of existing work (SAP features), some of which are essentially wrappers around other work (interfacing with WhisperSeg segmentations), and some of which are new (similarity scoring). All of this has value, but in my estimation, it has less value as part of a standalone package and potentially much more as part of an ecosystem like vocalpy that is undergoing continuous development and has long-term support. 

      We appreciate this reviewer’s comments and concerns about the structure of the AVN package and its long-term maintenance. We have considered incorporating AVN into the VocalPy ecosystem but have chosen not to for a few key reasons. (1) AVN was designed with ease of use for experimenters with limited coding experience top of mind. VocalPy provides excellent resources for researchers with some familiarity with object-oriented programming to manage and analyze their datasets; however, we believe it may be challenging for users without such experience to adopt VocalPy quickly. AVN’s ‘recipe’ approach, as you put it, is very easily accessible to new users, and allows users with intermediate coding experience to easily navigate the source code to gain a deeper understanding of the methodology. AVN also consistently outputs processed data in familiar formats (tables in .csv files which can be opened in excel), in an effort to make it more accessible to new users, something which would be challenging to reconcile with VocalPy’s emphasis on their `dataset`classes. (2) AVN and VocalPy differ in their underlying goals and philosophies when it comes to flexibility vs. standardization of analysis pipelines. VocalPy is designed to facilitate mixing-and-matching of different spectrogram generation, segmentation, annotation etc. approaches, so that researchers can design and implement their own custom analysis pipelines. This flexibility is useful in many cases. For instance, it could allow researchers who have very different noise filtering and annotation needs, like those working with field recordings versus acoustic chamber recordings, to analyze their data using this platform. However, when it comes to comparisons across zebra finch research labs, this flexibility comes at the expense of direct comparison and integration of song features across research groups. This is the context in which AVN is most useful. It presents a single approach to song segmentation, labeling, and featurization that has been shown to generalize well across research groups, and which allows direct comparisons of the resulting features. AVN’s single, extensively validated, standard pipeline approach is fundamentally incompatible with VocalPy’s emphasis on flexibility. We are excited to see how VocalPy continues to evolve in the future, and recognize the value that both AVN and VocalPy bring to the songbird research community, each with their own distinct strengths, weaknesses, and ideal use cases. 

      While the code is well-documented, including web-based documentation for both the core package and the GUI, the latter is available only on Windows, which might limit the scope of adoption. 

      We thank the reviewer for their kind words about AVN’s documentation. We recognize that the GUI’s exclusive availability on Windows is a limitation, and we would be happy to collaborate with other researchers and developers in the future to build a Mac compatible version, should the demand present itself. That said, the python package works on all operating systems, so non-Windows users still have the ability to use AVN that way.

      That is to say, whether AVN is adopted by the field in the medium term will have much more to do with the quality of its maintenance and responsiveness to users than any particular feature, but I believe that many of the analysis recipes that the authors have carefully worked out may find their way into other code and workflows. 

      Second, two notes about new analysis approaches:

      (1) The authors propose a new means of measuring tutor-pupil similarity based on first learning a latent space of syllables via a self-supervised learning (SSL) scheme and then using the earth mover's distance (EMD) to calculate transport costs between the distributions of tutors' and pupils' syllables. While to my knowledge this exact method has not previously been proposed in birdsong, I suspect it is unlikely to differ substantially from the approach of autoencoding followed by MMD used in the Goffinet et al. paper. That is, SSL, like the autoencoder, is a latent space learning approach, and EMD, like MMD, is an integral probability metric that measures discrepancies between two distributions. (Indeed, the two are very closely related: https://stats.stackexchange.com/questions/400180/earth-movers-distance-andmaximum-mean-discrepency.) Without further experiments, it is hard to tell whether these two approaches differ meaningfully. Likewise, while the authors have trained on a large corpus of syllables to define their latent space in a way that generalizes to new birds, it is unclear why such an approach would not work with other latent space learning methods.  

      We recognize the similarities between these approaches and have included comparisons of the VAE and MMD as in the Goffinet paper to our triplet loss model and EMD.  As discussed in the updated results section of the paper and shown in the new Figure 6–figure supplement 1, the Triplet loss model with MMD performs best for evaluating song learning on new birds, not included in model training. We’ve updated the main text of the paper to reflect this switch from EMD to MMD for the primary similarity scoring approach. 

      (2) The authors propose a new method for maturity scoring by training a model (a generalized additive model) to predict the age of the bird based on a selected subset of acoustic features. This is distinct from the "predicted age" approach of Brudner, Pearson, and Mooney, which predicts based on a latent representation rather than specific features, and the GAM nicely segregates the contribution of each. As such, this approach may be preferred by many users who appreciate its interpretability.  

      In summary, my view is that this is a nice paper detailing a well-executed piece of software whose future impact will be determined by the degree of support and maintenance it receives from others over the near and medium term.

      Reviewer #3 (Public Review):

      Summary: 

      The authors invent song and syllable discrimination tasks they use to train deep networks. These networks they then use as a basis for routine song analysis and song evaluation tasks. For the analysis, they consider both data from their own colony and from another colony the network has not seen during training. They validate the analysis scores of the network against expert human annotators, achieving a correlation of 80-90%. 

      Strengths: 

      (1) Robust Validation and Generalizability: The authors demonstrate a good performance of the AVN across various datasets, including individuals exhibiting deviant behavior. This extensive validation underscores the system's usefulness and broad applicability to zebra finch song analysis, establishing it as a potentially valuable tool for researchers in the field.

      (2) Comprehensive and Standardized Feature Analysis: AVN integrates a comprehensive set of interpretable features commonly used in the study of bird songs. By standardizing the feature extraction method, the AVN facilitates comparative research, allowing for consistent interpretation and comparison of vocal behavior across studies.

      (3) Automation and Ease of Use. By being fully automated, the method is straightforward to apply and should introduce barely an adoption threshold to other labs.

      (4) Human experts were recruited to perform extensive annotations (of vocal segments and of song similarity scores). These annotations released as public datasets are potentially very valuable. 

      Weaknesses: 

      (1) Poorly motivated tasks. The approach is poorly motivated and many assumptions come across as arbitrary. For example, the authors implicitly assume that the task of birdsong comparison is best achieved by a system that optimally discriminates between typical, deaf, and isolated songs. Similarly, the authors assume that song development is best tracked using a system that optimally estimates the age of a bird given its song. My issue is that these are fake tasks since clearly, researchers will know whether a bird is an isolated or a deaf bird, and they will also know the age of a bird, so no machine learning is needed to solve these tasks. Yet, the authors imagine that solving these placeholder tasks will somehow help with measuring important aspects of vocal behavior.  

      We appreciate this reviewer’s concerns and apologize for not providing sufficiently clear rationale for the inclusion of our phenotype classifier and age regression models in the original manuscript. These tasks are not intended to be taken as a final, ultimate culmination of the AVN pipeline. Rather, we consider the carefully engineered 55-interpretable feature set to be AVN’s final output, and these analyses serve merely as examples of how that feature set can be applied. That said, each of these models do have valid experimental use cases that we believe are important and would like to bring to the attention of the reviewer.

      For one, we showed how the LDA model that can discriminate between typical, deaf, and isolate birds’ songs not only allows us to evaluate which features are most important for discriminating between these groups, but also allows comparison of the FoxP1 knock-down (FP1 KD) birds to each of these phenotypes. Based on previous work (Garcia-Oscos et al. 2021), we hypothesized that FP1 KD in these birds specifically impaired tutor song memory formation while sparing a bird’s ability to refine their own vocalizations through auditory feedback. Thus, we would expect their songs to resemble those of isolate birds, who lack a tutor song memory, but not to resemble deaf birds who lack a tutor song memory and auditory feedback of their own vocalizations to guide learning. The LDA model allowed us to make this comparison quantitatively for the first time and confirm our hypothesis that FP1 KD birds’ songs are indeed most like isolates’. In the future, as more research groups publish their birds’ AVN feature sets, we hope to be able to make even more fine-grained comparisons between different groups of birds, either using LDA or other similar interpretable classifiers. 

      The age prediction model also has valid real-world use cases. For instance, one might imagine an experimental manipulation that is hypothesized to accelerate or slow song maturation in juvenile birds. This age prediction model could be applied to the AVN feature sets of birds having undergone such a manipulation to determine whether their predicted ages systematically lead or lag their true biological ages, and which song features are most responsible for this difference. We didn’t have access to data for any such birds for inclusion in this paper, but we hope that others in the future will be able to take inspiration from our methodology and use this or a similar age regression model with AVN features in their research. We have added a couple lines to the ‘Comparing Song Disruptions with AVN Features’ and ‘Tracking Song Development with AVN Features’ sections of the results to make this more clear. 

      Along similar lines, authors assume that a good measure of similarity is one that optimally performs repeated syllable detection (i.e. to discriminate same syllable pairs from different pairs). The authors need to explain why they think these placeholder tasks are good and why no better task can be defined that more closely captures what researchers want to measure. Note: the standard tasks for self-supervised learning are next word or masked word prediction, why are these not used here? 

      This reviewer appears to have misunderstood our similarity scoring embedding model and our rationale for using it. We will explain it in more depth here and have added a paragraph to the ‘Measuring Song Imitation’ section of the results explaining this rationale more briefly.

      First, nowhere are we training a model to discriminate between same and different syllable pairs. The triplet loss network is trained to embed syllables in an 8-dimensional space such that syllables with the same label are closer together than syllables with different labels. The loss function is related to the relative distance between embeddings of syllables with the same or different labels, not the classification of syllables as same or different. This approach was chosen because it has repeatedly been shown to be a useful data compression step (Schorff et al. 2015, Thakur et al. 2019) before further downstream tasks are applied on its output, particularly in contexts where there is little data per class (syllable label). For example, Schorff et al. 2015 trained a deep convolutional neural network with triplet loss to embed images of human faces from the same individual closer together than images of different individuals in a 128dimensional space. They then used this model to compute 128-dimensional representations of additional face images, not included in training, which were used for individual facial recognition (this is a same vs. different category classifier), and facial clustering, achieving better performance than the previous state of the art. The triplet loss function results in a model that can generate useful embeddings of previously unseen categories, like new individuals’ faces, or new zebra finches’ syllables, which can then be used in downstream analyses. This meaningful, lower dimensional space allows comparisons of distributions of syllables across birds, as in Brainard and Mets 2008, and Goffinet et al. 2021. 

      Next word and masked word prediction are indeed common self-supervised learning tasks for models working with text data, or other data with meaningful sequential organization. That is not the case for our zebra finch syllables, where every bird’s syllable sequence depends only on its tutor’s sequence, and there is no evidence for strong universal syllable sequencing rules (James et al. 2020). Rather, our embedding model is an example of a computer vision task, as it deals with sets of two-dimensional images (spectrograms), not sequences of categorical variables (like text). It is also not, strictly speaking, a selfsupervised learning task, as it does require syllable labels to generate the triplets. A common selfsupervised approach for dimensionality reduction in a computer vision task such as this one would be to train an autoencoder to compress images to a lower dimensional space, then faithfully reconstruct them from the compressed representation.  This has been done using a variational autoencoder trained on zebra finch syllables in Goffinet et al. 2021. In keeping with the suggestions from reviewers #1 and #2, we have included a comparison of our triplet loss model with the Goffinet et al. VAE approach in the revised manuscript. 

      (2) The machine learning methodology lacks rigor. The aims of the machine learning pipeline are extremely vague and keep changing like a moving target. Mainly, the deep networks are trained on some tasks but then authors evaluate their performance on different, disconnected tasks. For example, they train both the birdsong comparison method (L263+) and the song similarity method (L318+) on classification tasks. However, they evaluate the former method (LDA) on classification accuracy, but the latter (8-dim embeddings) using a contrast index. In machine learning, usually, a useful task is first defined, then the system is trained on it and then tested on a held-out dataset. If the sensitivity index is important, why does it not serve as a cost function for training?

      Again, this reviewer seems not to understand our similarity scoring methodology. Our similarity scoring model is not trained on a classification task, but rather on an embedding task. It learns to embed spectrograms of syllables in an 8-dimensional space such that syllables with the same label are closer together than syllables with different labels. We could report the loss values for this embedding task on our training and validation datasets, but these wouldn’t have any clear relevance to the downstream task of syllable distribution comparison where we are using the model’s embeddings. We report the contrast index as this has direct relevance to the actual application of the model and allows comparisons to other similarity scoring methods, something that the triplet loss values wouldn’t allow. 

      The triplet loss method was chosen because it has been shown to yield useful low-dimensional representations of data, even in cases where there is limited labeled training data (Thakur et al. 2019). While we have one of the largest manually annotated datasets of zebra finch songs, it is still quite small by industry deep learning standards, which is why we chose a method that would perform well given the size of our dataset. Training a model on a contrast index directly would be extremely computationally intensive and require many more pairs of birds with known relationships than we currently have access to. It could be an interesting approach to take in the future, but one that would be unlikely to perform well with a dataset size typical to songbird research. 

      Also, usually, in solid machine learning work, diverse methods are compared against each other to identify their relative strengths. The paper contains almost none of this, e.g. authors examined only one clustering method (HDBSCAN).  

      We did compare multiple methods for syllable segmentation (WhisperSeg, TweetyNet, and Amplitude thresholding) as this hadn’t been done previously. We chose not to perform extensive comparison of different clustering methods as Sainburg et al. 2020 already did so and we felt no need to reduplicate this effort. We encourage this reviewer to refer to Sainburg et al.’s excellent work for comparisons of multiple clustering methods applied to zebra finch song syllables.

      (3) Performance issues. The authors want to 'simplify large-scale behavioral analysis' but it seems they want to do that at a high cost. (Gu et al 2023) achieved syllable scores above 0.99 for adults, which is much larger than the average score of 0.88 achieved here (L121). Similarly, the syllable scores in (Cohen et al 2022) are above 94% (their error rates are below 6%, albeit in Bengalese finches, not zebra finches), which is also better than here. Why is the performance of AVN so low? The low scores of AVN argue in favor of some human labeling and training on each bird.  

      Firstly, the syllable error rate scores reported in Cohen et al. 2022 are calculated very differently than the F1 scores we report here and are based on a model trained with data from the same bird as was used in testing, unlike our more general segmentation approach where the model was tested on different birds than were used in training. Thus, the scores reported in Cohen et al. and the F1 scores that we report cannot be compared. 

      The discrepancy between the F1<sub>seg</sub> scores reported in Gu et al. 2023 and the segmentation F1 scores that we report are likely due to differences in the underlying datasets. Our UTSW recordings tend to have higher levels of both stationary and non-stationary background noise, which make segmentation more challenging. The recordings from Rockefeller were less contaminated by background noise, and they resulted in slightly higher F1 scores. That said, we believe that the primary factor accounting for this difference in scores with Gu et al. 2023 is the granularity of our ‘ground truth’ syllable segments. In our case, if there was never any ambiguity as to whether vocal elements should be segmented into two short syllables with a very short gap between them or merged into a single longer syllable, we chose to split them. WhisperSeg had a strong tendency to merge the vocal elements in ambiguous cases such as these. This results in a higher rate of false negative syllable onset detections, reflected in the low recall scores achieved by WhisperSeg (see Figure 2–figure supplement 1b), but still very high precision scores (Figure 2–figure supplement 1a). While WhisperSeg did frequently merge these syllables in a way that differed from our ground truth segmentation, it did so consistently, meaning it had little impact on downstream measures of syntax entropy (Figure 3c) or syllable duration entropy (Figure 3–figure supplement 2a). It is for that reason that, despite a lower F1 score, we still consider AVN’s automatically generated annotations to be sufficiently accurate for downstream analyses. 

      Should researchers require a higher degree of accuracy and precision with their annotations (for example, to detect very subtle changes in song before and after an acute manipulation) we suggest they turn toward one of the existing tools for supervised song annotation, such as TweetyNet.

      (4) Texas bias. It is true that comparability across datasets is enhanced when everyone uses the same code. However, the authors' proposal essentially is to replace the bias between labs with a bias towards birds in Texas. The comparison with Rockefeller birds is nice, but it amounts to merely N=1. If birds in Japanese or European labs have evolved different song repertoires, the AVN might not capture the associated song features in these labs well.  

      We appreciate the author’s concern about a bias toward birds from the UTSW colony. However, this paper shows that despite training (for the similarity scoring) and hyperparameter fitting (for the HDBSCAN clustering) on the UTSW birds, AVN performs as well if not better on birds from Rockefeller than from UTSW. To our knowledge, there are no publicly available datasets of annotated zebra finch songs from labs in Europe or in Asia but we would be happy to validate AVN on such datasets, should they become available. Furthermore, there is no evidence to suggest that there is dramatic drift in zebra finch vocal repertoire between continents which would necessitate such additional validation. While we didn’t have manual annotations for this dataset (which would allow validation of our segmentation and labeling methods), we did apply AVN to recordings shared with us by the Wada lab in Japan, where visual inspection of the resulting annotations suggested comparable accuracy to the UTSW and Rockefeller datasets. 

      (5) The paper lacks an analysis of the balance between labor requirement, generalizability, and optimal performance. For tasks such as segmentation and labeling, fine-tuning for each new dataset could potentially enhance the model's accuracy and performance without compromising comparability. E.g. How many hours does it take to annotate hundred song motifs? How much would the performance of AVN increase if the network were to be retrained on these? The paper should be written in more neutral terms, letting researchers reach their own conclusions about how much manual labor they want to put into their data.  

      With standardization and ease of use in mind, we designed AVN specifically to perform fully automated syllable annotation and downstream feature calculations. We believe that we have demonstrated in this manuscript that our fully automated approach is sufficiently reliable for downstream analyses across multiple zebra finch colonies. That said, if researchers require an even higher degree of annotation precision and accuracy, they can turn toward one of the existing methods for supervised song annotation, such as TweetyNet. Incorporating human annotations for each bird processed by AVN is likely to improve its performance, but this would require significant changes to AVN’s methodology, and is outside the scope of our current efforts.

      (6) Full automation may not be everyone's wish. For example, given the highly stereotyped zebra finch songs, it is conceivable that some syllables are consistently mis-segmented or misclassified. Researchers may want to be able to correct such errors, which essentially amounts to fine-tuning AVN. Conceivably, researchers may want to retrain a network like the AVN on their own birds, to obtain a more fine-grained discriminative method.  

      Other methods exist for supervised or human-in-the-loop annotation of zebra finch songs, such as TweetyNet and DAN (Alam et al. 2023). We invite researchers who require a higher degree of accuracy than AVN can provide to explore these alternative approaches for song annotation. Incorporating human feedback into AVN was never the goal of our pipeline, would require significant changes to AVN’s design and is outside the scope of this manuscript.

      (7) The analysis is restricted to song syllables and fails to include calls. No rationale is given for the omission of calls. Also, it is not clear how the analysis deals with repeated syllables in a motif, whether they are treated as two-syllable types or one.  

      It is true that we don’t currently have any dedicated features to describe calls. This could be a useful addition to AVN in the future. 

      What a human expert inspecting a spectrogram would typically call ‘repeated syllables’ in a bout are almost always assigned the same syllable label by the UMAP+HDBSCAN clustering. The syntax analysis module includes features examining the rate of syllable repetitions across syllable types, as mentioned in lines 222-226 of the revised manuscript. See https://avn.readthedocs.io/en/latest/syntax_analysis_demo.html#Syllable-Repetitions for further details.

      (8) It seems not all human annotations have been released and the instruction sets given to experts (how to segment syllables and score songs) are not disclosed. It may well be that the differences in performance between (Gu et al 2023) and (Cohen et al 2022) are due to differences in segmentation tasks, which is why these tasks given to experts need to be clearly spelled out. Also, the downloadable files contain merely labels but no identifier of the expert. The data should be released in such a way that lets other labs adopt their labeling method and cross-check their own labeling accuracy.  

      All human annotations used in this manuscript have indeed been released as part of the accompanying dataset. Syllable annotations are not provided for all pupils and tutors used to validate the similarity scoring, as annotations are not necessary for similarity comparisons. We have expanded our description of our annotation guidelines in the methods section of the revised manuscript. All the annotations were generated by one of two annotators. The second annotator always consulted with the first annotator in cases of ambiguous syllable segmentation or labeling, to ensure that they had consistent annotation styles. Unfortunately, we haven’t retained records about which birds were annotated by which of the two annotators, so we cannot share this information along with the dataset. The data is currently available in a format that should allow other research groups to use our annotations either to train their own annotation systems or check the performance of their existing systems on our annotations.  

      (9) The failure modes are not described. What segmentation errors did they encounter, and what syllable classification errors? It is important to describe the errors to be expected when using the method. 

      As we discussed in our response to this reviewer’s point (3), WhisperSeg has a tendency to merge syllables when the gap between them is very short, which explains its lower recall score compared to its precision on our dataset (Figure 2–figure supplement 1). In rare cases, WhisperSeg also fails to recognize syllables entirely, again impacting its precision score. TweetyNet hardly ever completely ignores syllables, but it does tend to occasionally merge syllables together or over-segment them. Whereas WhisperSeg does this very consistently for the same syllable types within the same bird, TweetyNet merges or splits syllables more inconsistently. This inconsistent merging and splitting has a larger effect on syllable labeling, as manifested in the lower clustering v-measure scores we obtain with TweetyNet compared to WhisperSeg segmentations. TweetyNet also has much lower precision than WhisperSeg, largely because TweetyNet often recognizes background noises (like wing flaps or hopping) as syllables whereas WhisperSeg hardly ever segments non-vocal sounds. 

      Many errors in syllable labeling stem from differences in syllable segmentation. For example, if two syllables with labels ‘a’ and ‘b’ in the manual annotation are sometimes segmented as two syllables, but sometimes merged into a single syllable, the clustering is likely to find 3 different syllable types; one corresponding to ‘a’, one corresponding to ‘b’ and one corresponding to ‘ab’ merged. Because of how we align syllables across segmentation schemes for the v-measure calculation, this will look like syllable ‘b’ always has a consistent cluster label (or is missing a label entirely), but syllable ‘a’ can carry two different cluster labels, depending on the segmentation. In certain cases, even in the absence of segmentation errors, a group of syllables bearing the same manual annotation label may be split into 2 or 3 clusters (it is extremely rare for a single manual annotation group to be split into more than 3 clusters). In these cases, it is difficult to conclusively say whether the clustering represents an error, or if it actually captured some meaningful systematic difference between syllables that was missed by the annotator. Finally, sometimes rare syllable types with their own distinct labels in the manual annotation are merged into a single cluster. Most labeling errors can be explained by this kind of merging or splitting of groups relative to the manual annotation, not to occasional mis-classifications of one manual label type as another.

      For examples of these types of errors, we encourage this reviewer and readers to refer to the example confusion matrices in figure 2f and Figure 2–figure supplement 3b&e. We also added two paragraphs to the end of the ‘Accurate, fully unsupervised syllable labeling’ section of the Results in the revised manuscript. 

      (10) Usage of Different Dimensionality Reduction Methods: The pipeline uses two different dimensionality reduction techniques for labeling and similarity comparison - both based on the understanding of the distribution of data in lower-dimensional spaces. However, the reasons for choosing different methods for different tasks are not articulated, nor is there a comparison of their efficacy.  

      We apologize for not making this distinction sufficiently clear in the manuscript and have added a paragraph to the ‘Measuring Song Imitation’ section of the Results explaining the rational for using an embedding model for similarity scoring. 

      We chose to use UMAP for syllable labeling because it is a common embedding methodology to precede hierarchical clustering and has been shown to result in reliable syllable labels for birdsong in the past (Sainburg et al. 2020). However, it is not appropriate for similarity scoring, because comparing EMD or MMD scores between birds requires that all the birds’ syllable distributions exist within the same shared embedding space. This can be achieved by using the same triplet loss-trained neural network model to embed syllables from all birds. This cannot be achieved with UMAP because all birds whose scores are being compared would need to be embedded in the same UMAP space, as distances between points cannot be compared across UMAPs. In practice, this would mean that every time a new tutor-pupil pair needs to be scored, their syllables would need to be added to a matrix with all previously compared birds’ syllables, a new UMAP would need to be computed, and new EMD or MMD scores between all bird pairs would need to be calculated using their new UMAP embeddings. This is very computationally expensive and quickly becomes unfeasible without dedicated high power computing infrastructure. It also means that similarity scores couldn’t be compared across papers without recomputing everything each time, whereas EMD and MMD scores obtained with triplet loss embeddings can be compared, provided they use the same trained model (which we provide as part of AVN) to embed their syllables in a common latent space. 

      (11) Reproducibility: are the measurements reproducible? Systems like UMAP always find a new embedding given some fixed input, so the output tends to fluctuate.

      There is indeed a stochastic element to UMAP embeddings which will result in different embeddings and therefore different syllable labels across repeated runs with the same input. We observed that v-measures scores were quite consistent within birds across repeated runs of the UMAP, and have added an additional supplementary figure to the revised manuscript showing this (Figure 2–figure supplement 4).

      Reviewer #1 (Recommendations For The Authors):

      (1) Benchmark their similarity score to the method used by Goffinet et al, 2021 from the Pearson group. Such a comparison would be really interesting and useful.  

      This has been added to the paper. 

      (2) Please clarify exactly what is new and what is applied from existing methods to help the reader see the novelty of the paper.  

      We have added more emphasis on the novel aspects of our pipeline to the paper’s introduction. 

      Minor:

      It's unclear if AVN is appropriate as the paper deals only with zebra finch song - the scope is more limited than advertised.

      We assume this is in reference to ‘Birdsong’ in the paper’s title and ‘Avian’ in Avian Vocalization Network. There is a brief discussion of how these methods are likely to perform on other commonly studied songbird species at the end of the discussion section.

      Reviewer #2 (Recommendations For The Authors):

      A few points for the authors to consider that might strengthen or inform the paper:

      (1) In the public review, I detailed some ways in which the SSL+EMD approach is unlikely to be appreciably distinct from the VAE+MMD approach -- in fact, one could mix and match here. It would strengthen the authors' claim if they showed via experiments that their method outperforms VAE+MMD, but in the absence of that, a discussion of the relation between the two is probably warranted.  

      This comparison has been added to the paper.

      (2) ll. 305-310: This loss of accuracy near the edge is expected on general Bayesian grounds. Any regression approach should learn to estimate the conditional mean of the age distribution given the data, so ages estimated from data will be pulled inward toward the location of most training data. This bias is somewhat mitigated in the Brudner paper by a more flexible model, but it's a general (and expected) feature of the approach.

      (3) While the online AVA documentation looks good, it might benefit from a page on design philosophy that lays out how the various modules fit together - something between the tutorials and the nitty-gritty API. That way, users would be able to get a sense of where they should look if they want to harness pieces of functionality beyond the tutorials.

      Thank you for this suggestion. We will add a page on AVN’s design philosophy to the online documentation. 

      (4) While the manuscript does compare AVN to packages like TweetyNet and AVA that share some functionality, it doesn't really mention what's been going on with the vocalpy ecosystem, where the maintainers have been doing a lot to standardize data processing, integrate tools, etc. I would suggest a few words about how AVN might integrate with these efforts.

      We thank the reviewer for this suggestion.

      (5) ll. 333-336: It would be helpful to provide a citation to some of the self-supervised learning literature this procedure is based on. Some citations are provided in methods, but the general approach is worth citing, in my opinion. 

      We have added a paragraph to the results section with more background on self-supervised learning for dimensionality reduction, particularly in the context of similarity scoring.

      (6) One software concern for medium-term maintenance: AVN docs say to use Python 3.8, and GitHub says the package is 3.9 compatible. I also saw in the toml file that 3.10 and above are not supported. It's worth noting that Python 3.9 reaches its end of life in October 2025, so some dependencies may have to be altered or changed for the package to be viable going forward.  

      Thank you for this comment. We will continue to maintain AVN and update its dependencies as needed.

      Minor points:

      (1) It might be good to note that WhisperSeg is a different install from AVN. May be hard for novice users, though there's a web interface that's available. 

      We’ve added a line to the methods section making this clear. 

      (2) Figure 6b: Some text in the y-axis labels is overlapping here. 

      This has been fixed. Thank you for bringing it to our attention. 

      (3) The name of the Python language is always capitalized.  

      We’ve fixed this capitalization error throughout the manuscript. Thank you.

      Reviewer #3 (Recommendations For The Authors):

      (1) I recommend that the authors improve the motivation of the chosen tasks and data or choose new tasks that more clearly speak to the optimizations they want to perform. 

      We have included more details about the motivation for our LDA classification analysis, age prediction model and embedding model for similarity scoring in the results of the revised manuscript, as discussed in more detail in the above responses to this reviewer. Thank you for these suggestions. 

      (2) They need to rigorously report the (classification) scores on the test datasets: these are the scores associated with the cost function used during training.  

      Based on this reviewer’s ‘Weaknesses: 3’ comment in the public reviews, we believe that they are referring to a classification score for the triplet loss model. As we explained in response to that comment, this is not a classification task, therefor there is no classification score to report. The loss function used to train the model was a triplet loss function. While we could report these values, they are not informative for how well this approach would perform in a similarity scoring context, as explained above. As such, we prefer to include contrast index and tutor contrast index scores to compare the models’ performance for similarity score, as these are directly relevant to the task and are established in the field for said task.

      (3) They need to explain the reasons for the poor performance (or report on the inconsistencies with previous work) and why they prefer a fully automated system rather than one that needs some fine-tuning on bird-specific data.

      We’ve addressed this comment in the public response to this reviewer’s weakness points 3, 5, and 6. 

      (4) They should consider applying their method to data from Japanese and European labs.  

      We’ve addressed this comment in the public response to this reviewer’s weakness point 4.

      (5) The need to document the failure modes and report all details about the human annotations.  

      We’ve added additional description of the failure modes for our segmentation and labeling approaches in the results section of the revised manuscript.

      Details: 

      The introduction is very vague, it fails to make a clear case of what the problem is and what the approach is. It reads a bit like an advertisement for machine learning: we are given a hammer and are looking for a nail.  

      We thank the reviewer for this viewpoint; however, we disagree and have decided to keep our Introduction largely unchanged. 

      L46 That interpretability is needed to maximize the benefits of machine learning is wrong, see self-driving cars and chat GPT.  

      This line states that ‘To truly maximize the benefits of machine learning and deep learning methods for behavior analysis, their power must be balanced with interpretability and generalizability’. We firmly believe that interpretability is critically important when using machine learning tools to gain a deeper scientific understanding of data, including animal behavior data in a neuroscience context. We believe that the introduction and discussion of this paper already provide strong evidence for this claim. 

      L64 What about zebra finches that repeat a syllable in the motif, how are repetitions dealt with by AVN?  

      This is already described in the results section in lines 222-226, and in the methods in the ‘Syntax Features: Repetition Bouts’ section.

      L107 Say a bit more here, what exactly has been annotated?  

      We’ve added a sentence in the introduction to clarify this. Line 113-115. 

      L112 Define spectrogram frames. Do these always fully or sometimes partially contain a vocalization? 

      Spectrogram frames are individual time bins used to compute the spectrogram using a short-term Fourier transform. As described in the ‘Methods; Labeling : UMAP Dimensionality Reduction” section, our spectrograms are computed using ‘The short term Fourier transform of the normalized audio for each syllable […] with a window length of 512 samples and a hop length of 128 samples’. Given that the song files have a standard sampling rate of 44.1kHz, this means each time bin represents 11.6ms of song data, with successive frames advancing in time by 2.9ms. These contain only a small fraction of a vocalization. 

      L122 The reported TweetyNet score of 0.824 is lower than the one reported in Figure 2a.  

      The center line in the box plot in Figure 2a represents the median of the distribution of TweetyNet vmeasure scores. Given that there are a couple outlying birds with very low scores, the mean (0.824 as reported in the text of the results section) is lower than the median. This is not an error.

      L155 Some of the differences in performance are very small, reporting of the P value might be necessary. 

      These methods are unlikely to statistically significantly differ in their validation scores. This doesn’t mean that we cannot use the mean/median values reported to justify favoring one method over another. This is why we’ve chosen not to report p-values here.

      L161 The authors have not really tested more than a single clustering method, failing to show a serious attempt to achieve good performance.  

      We’ve addressed this comment in the public response to this reviewer’s weakness point 2.

      L186 Did isolate birds produce stereotyped syllables that can be clustered? 

      Yes, they did. The validation for clustering of isolate bird songs can be found in Figure 2–figure supplement 4. 

      Fig. 3e: How were the multiple bouts aligned?

      This is described in lines 857-876 in the ‘Methods: Song Timing Features: Rhythm Spectrograms” section of the paper.

      L199 There is a space missing in front of (n=8).  

      Thank you for bringing this to our attention. It’s been corrected in the updated manuscript. 

      L268 Define classification accuracy.  

      We’ve added a sentence in lines 953-954 of the methods section defining classification accuracy. 

      L325 How many motifs need to be identified, why does this need to be done manually? There are semiautomated methods that can allow scaling, these should be  cited here. Also, the mention of bias here should be removed in favor of a more extensive discussion on the experimenter bias (traditionally vs Texas bias (in this paper).  

      All of the methods cited in this line have graphical user interfaces that require users to select a file containing song and manually highlight the start and end each motif to be compared. The exact number of motifs required varies depending on the specific context (e.g. more examples are needed to detect more subtle differences or changes in song similarity) but it is fairly standard for reviewers to score 30 – 100 pairs of motifs. 

      We’ve discussed the tradeoffs between full automation and supervised or human-in-the loop methods in response to this reviewer’s public comment ‘weakness #5 and 6’. Briefly, AVN’s aim is to standardize song analysis, to allow direct comparisons between song features and similarity scores across research groups. We believe, as explained in the paper, that this can be best achieve by having different research groups use the same deep learning models, which perform consistently well across those groups. Introducing semi-automated methods would defeat this benefit of AVN. 

      We’ve also addressed the question of ‘Texas bias’ in response to their reviewer’s public comment ‘Weakness #4’. 

      L340 How is EMD applied? Syllables are points in 8-dim space, but now suddenly authors talk about distributions without explaining how they got from points to distributions. Same in L925.  

      We apologize for the confusion here. The syllable points in the 8-d space are collectively an empirical distribution, not a probability distribution. We referred to them simply as ‘distributions’ to limit technical jargon in the results of the paper, but have changed this to more precise language in the revised manuscript.

      L351 Why do authors now use 'contrast index' to measure performance and no longer 'classification accuracy'?  

      We’ve addressed this comment in the public response to this reviewer’s weakness points 1 and 2.

      Figure 6 What is the confusion matrix, i.e. how well can the model identify pupil-pupil pairings from pupiltutor and from pupil-unrelated pairings? I guess that would amount to something like classification accuracy.  

      There is no model classifying comparisons as pupil-pupil vs. pupil-tutor etc. These comparisons exist only to show the behavior of the similarity scoring approach, which consists of a dissimilarity measure (MMD or EMD) applied to low dimensional representations of syllable generated by the triplet loss model or VAE. This was clarified further in our public response to this reviewer’s weakness points 1 and 2. 

      L487 What are 'song files', and what do they contain?   

      ‘Song files’ are .wav files containing recordings of zebra finch song. They typically contain a single song bout, but they can include multiple song bouts if they are produced close together, or incomplete song bouts if the introductory notes were very soft or the bouts were very long (>30s from the start of the file). Details of these recordings are provided in the ‘Methods: Data Acquisition: UTSW Dataset’ section of the manuscript.

      L497 Calls were only labelled for tweetynet but not for other tasks.  

      That is correct. The rationale for this is provided in the ‘Methods: Manual Song Annotation’ section of the manuscript. 

      L637 There is a contradiction (can something be assigned to the 'own manual annotation category' when the same sentence states that this is done 'without manual annotation'?) 

      We believe there is confusion here between automated annotation and validation. Any bird can be automatically annotated without the need for any existing manual annotations for that individual bird. However, manual labels are required to compare automatically generated annotations against for validation of the method.

      L970 Spectograms of what? (what is the beginning of a song bout, L972). 

      The beginning of a song bout is the first introductory note produced by a bird after a period without vocalizations. This is standard.

    1. Reviewer #3 (Public review):

      Summary:

      The aim of this study was to investigate the temporal progression of the neural response to event boundaries in relation to uncertainty and error. Specifically, the authors asked (1) how neural activity changes before and after event boundaries, (2) if uncertainty and error both contribute to explaining the occurrence of event boundaries, and (3) if uncertainty and error have unique contributions to explaining the temporal progression of neural activity.

      Strengths:

      One strength of this paper is that it builds on an already validated computational model. It relies on straightforward and interpretable analysis techniques to answer the main question, with a smart combination of pattern similarity metrics and FIR. This combination of methods may also be an inspiration to other researchers in the field working on similar questions. The paper is well written and easy to follow. The paper convincingly shows that (1) there is a temporal progression of neural activity change before and after an event boundary, and (2) event boundaries are predicted best by the combination of uncertainty and error signals.

      Weaknesses:

      Regarding question 3, I am less convinced by the results. They show that overlapping but somewhat distinct sets of brain regions relate to uncertainty and error boundaries over time. And that some regions show distinct patterns of temporal progressions in pattern change with both types of boundaries. However, most of the effects they observe in this analysis may still be driven by shared variance, as suggested by the results in Figure 6 and the high correlation between the two boundary time series. More specific comments are provided below.

      Impact:

      If these comments can be addressed sufficiently, I expect that this work will impact the field in its thinking on what drives event boundaries and spur interest in understanding the mechanisms behind the temporal progression of neural activity around these boundaries.

      Comments

      (1) The current analysis of the neural data does not convincingly show that uncertainty and prediction error both contribute to the neural responses. As both terms are modelled in separate FIR models, it may be that the responses we see for both are mostly driven by shared variance. Given that the correlation between the two is very high (r=0.49), this seems likely. The strong overlap in the neural responses elicited by both, as shown in Figure 6, also suggests that what we see may mainly be shared variance. To improve the interpretability of these effects, I think it is essential to know whether uncertainty and error explain similar or unique parts of the variance. The observation that they have distinct temporal profiles is suggestive of some dissociation, but not as convincing as adding them both to a single model.

      (2) The results for uncertainty and error show that uncertainty has strong effects before or at boundary onset, while error is related to more stabilization after boundary onset. This makes me wonder about the temporal contribution of each of these. Could it be the case that increases in uncertainty are early indicators of a boundary, and errors tend to occur later?

      (3) Given that there is a 24-second period during which the neural responses are shaped by event boundaries, it would be important to know more about the average distance between boundaries and the variability of this distance. This will help establish whether the FIR model can properly capture a return to baseline.

      (4) Given that there is an early onset and long-lasting response of the brain to these event boundaries, I wonder what causes this. Is it the case that uncertainty or errors already increase at 12 seconds before the boundaries occur? Or if there are other makers in the movie that the brain can use to foreshadow an event boundary? And if uncertainty or errors do increase already 12 seconds before an event boundary, do you see a similar neural response at moments with similar levels of error or uncertainty, which are not followed by a boundary? This would reveal whether the neural activity patterns are specific to event boundaries or whether these are general markers of error and uncertainty.

      (5) It is known that different brain regions have different delays of their BOLD response. Could these delays contribute to the propagation of the neural activity across different brain areas in this study?

      (6) In the FIR plots, timepoints -12, 0, and 12 are shown. These long intervals preclude an understanding of the full temporal progression of these effects.

    1. Author response:

      Reviewer #1 (Public review):

      Summary:

      This study investigates how collective navigation improvements arise in homing pigeons. Building on the Sasaki & Biro (2017) experiment on homing pigeons, the authors use simulations to test seven candidate social learning strategies of varying cognitive complexity, ranging from simple route averaging to potentially cognitively demanding selective propagation of superior routes. They show that only the simplest strategy-equal route averaging-quantitatively matches the experimental data in both route efficiency and social weighting. More complex strategies, while potentially more effective, fail to align with the observed data. The authors also introduce the concept of "effective group size," showing that the chaining design leads to a strong dilution of earlier individuals' contributions. Overall, they conclude that cognitive simplicity rather than cumulative cultural evolution explains collective route improvements in pigeons.

      Strengths:

      The manuscript addresses an important question and provides a compelling argument that a simpler hypothesis is necessary and sufficient to explain findings of a recent influential study on pigeon route improvements, via a rigorous systematic comparison of seven alternative hypotheses. The authors should be commended for their willingness to critically re-examine established interpretations. The introduction and discussion are broad and link pigeon navigation to general debates on social learning, wisdom of crowds, and CCE.

      We thank the reviewer for their positive comments.

      Weaknesses:

      The lack of availability of codes and data for this manuscript, especially given that it critically examines and proposes alternative hypotheses for an important published work.

      We thank the reviewer for their comment. The code and data for our manuscript are an important aspect of the study, and we had intended to make them publicly available upon publication. The link to our code and data on figshare can be found here: (https://doi.org/10.6084/m9.figshare.28950032.v1). We will further add this link to the Data Availability Statement of our revised version.  

      Reviewer #2 (Public review):

      Summary:

      The manuscript investigates which social navigation mechanisms, with different cognitive demands, can explain experimental data collected from homing pigeons. Interestingly, the results indicate that the simplest strategy - route averaging - aligns best with the experimental data, while the most demanding strategy - selectively propagating the best route - offers no advantage. Further, the results suggest that a mixed strategy of weighted averaging may provide significant improvements.

      The manuscript addresses the important problem of identifying possible mechanisms that could explain observed animal behavior by systematically comparing different candidate models. A core aspect of the study is the calculation of collective routes from individual bird routes using different models that were hypothesized to be employed by the animals, but which differ in their cognitive demands.

      The manuscript is well-written, with high-quality figures supporting both the description of the approach taken and the presentation of results. The results should be of interest to a broad community of researchers investigating (collective) animal behavior, ranging from experiment to theory. The general approach and mathematical methods appear reasonable and show no obvious flaws. The statistical methods also appear.

      Strengths:

      The main strength of the manuscript is the systematic comparison of different meta-mechanisms for social navigation by modeling social trajectories from solitary trajectories and directly comparing them with experimental results on social navigation. The results show that the experimentally observed behavior could, in principle, arise from simple route averaging without the need to identify "knowledgeable" individuals. Another strength of the work is the establishment of a connection between social navigation behavior and the broader literature on the wisdom of crowds through the concept of effective group size.

      We thank the reviewer for their positive comments.

      Weaknesses:

      However, there are two main weaknesses that should be addressed:

      (1) The first concerns the definition of "mechanism" as used by the authors, for example, when writing "navigation mechanism." Intuitively, one might assume that what is meant is a behavioral mechanism in the sense of how behavior is generated as a dynamic process. However, here it is used at a more abstract (meta) level, referring to high-level categories such as "averaging" versus "leader-follower" dynamics. It is not used in the sense of how an individual makes decisions while moving, where the actual route followed in a social context emerges from individuals navigating while simultaneously interacting with conspecifics in space and time. In the presented work, the approach is to directly combine (global) route data of solitary birds according to the considered "meta-mechanisms" to generate social trajectories. Of course, this is not how pigeon social navigation actually works-they do not sit together before the flight and say, "This is my route, this is your route, let's combine them in this way." A mechanistic modeling approach would instead be some form of agent-based model that describes how agents move and interact in space and time. Such a "bottom-up" approach, however, has its drawbacks, including many unknown parameters and often strongly simplifying (implicit) assumptions. I do not expect the authors to conduct agent-based modeling, but at the very least, they should clearly discuss what they mean by "mechanism" and clarify that while their approach has advantages-such as naturally accounting for the statistical features of solitary routes and allowing a direct comparison of different meta-mechanisms is also limited, as it does not address how behavior is actually generated. For example, the approach lacks any explicit modeling of errors, uncertainty, or stochasticity more broadly (e.g., due to environmental influences). Thus, while the presented study yields some interesting results, it can only be considered an intermediate step toward understanding actual behavioral mechanisms.

      We thank the reviewer for their comment and thoughtful suggestions. We agree that the inherent behavioral mechanisms and the biological basis of these mechanisms cannot be determined just through the navigational data alone. For instance, it remains unexplored if pigeons are adapting their behavior based only on social cues from their partners or using other navigational features such as landmarks or roads, location of the sun, geomagnetic cues or prior learnt routes. However, we do agree (as also pointed by the reviewer) that these behavioral rules generate an emergent ‘meta-mechanism’ where the bird pairs are behaving as if their preferred routes are averaged during a flight. It will be important in future work to explore the biological basis of these mechanisms, but our current approach allows us to only describe the mechanisms in a meta sense with any confidence. Considering this, we believe that our analysis is a more top-down approach towards describing the outcomes of these underlying mechanisms in an abstract sense. We would also like to point the reviewer to Dalmaijer, 2024 [1] who used a bottom up approach, using naive agents and showed that cumulative route improvements emerged in the absence of any sophisticated communication in the same dataset, in agreement with our approach. Considering these points, we will make changes in our revised version to clearly elaborate on what the definition of ‘mechanism’ should include in line with the reviewer’s feedback.

      (2) While the presented study raises important questions about the applicability and viability of cumulative cultural evolution (CCE) in explaining certain animal behaviors such as social navigation, I find that it falls short in discussing them. What are the implications regarding the applicability of CCE to animal data and to previously claimed experimental evidence for CCE? Should these experiments be re-analyzed or critically reassessed? If not, why? What are good examples from animal behavior where CCE should not be doubted? Furthermore, what about the cited definitions and criteria of CCE? Are they potentially too restrictive? Should they be revised-and if so, how? Conversely, if the definitions become too general, is CCE still a useful concept for studying certain classes of animal behavior? I think these are some of the very important questions that could be addressed or at least raised in the discussion to initiate a broader debate within the community.

      We thank the reviewer for their comments and interesting questions regarding our study. We agree with the reviewer that our study opens up new avenues for critically analysing the criteria previous studies have used for providing evidence of CCE in non-human animals. According to our literature review, we found that the field has been usually motivated in thinking about CCE in a ‘process’ focused manner (Reindl et al. [2]) in regards to individuals being able to compare strategies and selecting ones resulting in higher individual fitness. This preferential selection of strategies – termed innovations — allows for the stereotypical ratcheting effect seen in CCE. In our study, we propose that in the case of homing pigeons, the ratcheting effect is more of a statistical outcome rather than deliberate individual judgement. We believe that this strategy is also amenable to certain task types (which in our study was homing route choice) and may change for others (for example solving a puzzle box) and the task also needs to be sufficiently complex for animals to benefit from the use of social information (Caldwell et al. 2008 [3]). Thus, we recommend future work to address what classes of problems would fit well within the definition of “emergent” CCE and which ones don’t. Keeping this framework in mind, studies should clearly state what definition of CCE they are using and should be critically evaluated for their underlying task type and cognitive mechanisms to deem them as CCE. Considering these points we will expand our discussion to highlight these key questions that could be critical to think upon for future research.

      References:

      (1) Dalmaijer ES (2024) Cumulative route improvements spontaneously emerge in artificial navigators even in the absence of sophisticated communication or thought. PLoS Biol. 22:e3002644.

      (2) Reindl, E., Gwilliams, A.L., Dean, L.G. et al. (2020) Skills and motivations underlying children’s cumulative cultural learning: case not closed. Palgrave Commun 6, 106.

      (3) Caldwell CA, Millen AE (2008) Studying cumulative cultural evolution in the laboratory. Phil. Trans. R. Soc. B 363:3529-3539.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      Summary:

      In this manuscript by Lopez-Blanch and colleagues, 21 microexons are selected for a deep analysis of their impacts on behavior, development, and gene expression. The authors begin with a systematic analysis of microexon inclusion and conservation in zebrafish and use these data to select 21 microexons for further study. The behavioral, transcriptomic, and morphological data presented are for the most part convincing. Furthermore, the discussion of the potential explanations for the subtle impacts of individual microexon deletions versus lossof-function in srrm3 and/or srrm4 is quite comprehensive and thoughtful. One major weakness: data presentation, methods, and jargon at times affect readability / might lead to overstated conclusions. However, overall this manuscript is well-written, easy to follow, and the results are of broad interest.

      We thank the Reviewer for their positive comments on our manuscript. In the revised version, we will try to improve readability, reduce jargon and avoid overstatements.  

      Strengths:

      (1) The study uses a wide variety of techniques to assess the impacts of microexon deletion, ranging from assays of protein function to regulation of behavior and development.

      (2) The authors provide comprehensive analyses of the molecular impact of their microexon deletions, including examining how host-gene and paralog expression is affected.

      Weaknesses:

      Major Points:

      (1) According to the methods, it seems that srrm3 social behavior is tested by pairing a 3mpf srrm3 mutant with a 30dpf srrm3 het. Is this correct? The methods seem to indicate that this decision was made to account for a slower growth rate of homozygous srrm3 mutant fish. However, the difference in age is potentially a major confound that could impact the way that srrm3 mutants interact with hets and the way that srrm3 mutants interact with one another (lower spread for the ratio of neighbour in front value, higher distance to neighbour value). This reviewer suggests testing het-het behavior at 3 months to provide age-matched comparisons for del-del, testing age-matched rather than size-matched het-del behavior, and also suggests mentioning this in the main text / within the figure itself so that readers are aware of the potential confound.

      Thank you for bringing up this point. For the tests shown in Figure 5, we indeed decided to match the pairs involving srrm3 mutant fish by fish size since we reasoned this would be more comparable to the other lines, both biologically and methodologically (in terms of video tracking, etc.). However, we are confident the results would be very similar if matched by age, since the differences in social interactions between the srrm3 homozygous mutants and their control siblings are very dramatic at any age. As an example, this can be appreciated, in line with the Reviewer's suggestion, in Videos S2 and S3, which show groups of five 5 mpf fish that are either srrm3 mutant or wild type. It can be observed that the behavior of 5 mpf WT fish (Video S3) is very similar to those of 1 mpf WT fish pairs, with very small interindividual distances, while the difference with repect to the srrm3 mutant group (Video S2) is dramatic. We nonetheless agree that this decision on the experimental design should be clearly stated in the main text and figure legend and we have done so in the revised version.

      (2) Referring to srrm3+/+; srrm4-/- controls for double mutant behavior as "WT for simplicity" is somewhat misleading. Why do the authors not refer to these as srrm4 single mutants?

      This comment applies to Figure 4 as well as the associated figure supplements. We reasoned that this made the understanding of plots easier, but the Reviewer is correct that it can be misleading. As a middle ground, we have now changed Figure 4 to follow the nomenclature of Figure 3D (WD, HD, DD), which is further explained in the legend, but kept the original format in the figure supplements for consistency with the (many) other plots in those figures.

      (3) It's not completely clear how "neurally regulated" microexons are defined / how they are different from "neural microexons"? Are these terms interchangeable?

      Yes, they are interchangeable. We have now double checked the wording to avoid confusion and for consistency.

      (4) Overexpression experiments driving srrm3 / srrm4 in HEK293 cells are not described in the methods.

      We apologized for this omission. We now briefly describe the data and asscoiated methods in more detail in the revised version; however, please note that the data was obtained from a previous publication (Torres-Mendez et al, 2019), where the detailed methodology is reported.

      (5) Suggest including more information on how neurite length was calculated. In representative images, it appears difficult to determine which neurites arise from which soma, as they cross extensively. How was this addressed in the quantification?

      We have added further details to the revised version. With regards to the specific question, we would like to mention that this has not been a very common issue for the time points used in the manuscript (10 hap and 24 hap). At those stages, it was nearly always evident how to track each individual neurite. Dubious cases were simply ignored and not measured, as we aimed for 100 neurites per well. Of course, such complex cases become much more common at later time points (48 and 72 hap), which were not used in this study.

      Reviewer #2 (Public review):

      Summary:

      This manuscript explores in zebrafish the impact of genetic manipulation of individual microexons and two regulators of microexon inclusion (Srrm3 and Srrm4). The authors compare molecular, anatomical, and behavioral phenotypes in larvae and juvenile fish. The authors test the hypothesis that phenotypes resulting from Srrm3 and 4 mutations might in part be attributable to individual microexon deletions in target genes.

      The authors uncover substantial alterations in in vitro neurite growth, locomotion, and social behavior in Srrm mutants but not any of the individual microexon deletion mutants. The individual mutations are accompanied by broader transcript level changes which may resemble compensatory changes. Ultimately, the authors conclude that the severe Srrm3/4 phenotypes result from additive and/or synergistic effects due to the de-regulation of multiple microexons.

      Strengths:

      The work is carefully planned, well-described, and beautifully displayed in clear, intuitive figures. The overall scope is extensive with a large number of individual mutant strains examined. The analysis bridges from molecular to anatomical and behavioral read-outs. Analysis appears rigorous and most conclusions are well-supported by the data.

      Overall, addressing the function of microexons in an in vivo system is an important and timely question.

      Weaknesses:

      The main weakness of the work is the interpretation of the social behavior phenotypes in the Srrm mutants. It is difficult to conclude that the mutations indeed impact social behavior rather than sensory processing and/or vision which precipitates apparent social alterations as a secondary consequence. Interpreting the phenotypes as "autism-like" is not supported by the data presented.

      The Reviewer is absolutely right. It was not our intention to imply that these social defects should be interpreted simply as autistic-like. It is indeed very likely that the main reason for the social alterations displayed by the srrm3 mutants is their impaired vision. We have now added this discussion point explicitly in the revised version. 

      Reviewer #3 (Public review):

      Summary:

      Microexons are highly conserved alternative splice variants, the individual functions of which have thus far remained mostly elusive. The inclusion of microexons in mature mRNAs increases during development, specifically in neural tissues, and is regulated by SRRM proteins. Investigation of individual microexon function is a vital avenue of research since microexon inclusion is disrupted in diseases like autism. This study provides one of the first rigorous screens (using zebrafish larvae) of the functions of individual microexons in neurodevelopment and behavioural control. The authors precisely excise 21 microexons from the genome of zebrafish using CRISPR-Cas9 and assay the downstream impacts on neurite outgrowth, larvae motility, and sociality. A small number of mild phenotypes were observed, which contrasts with the more dramatic phenotypes observed when microexon master regulators SRRM3/4 are disrupted. Importantly, this study attempts to address the reasons why mild/few phenotypes are observed and identify transcriptomic changes in microexon mutants that suggest potential compensatory gene regulatory mechanisms.

      Strengths:

      (1) The manuscript is well written with excellent presentation of the data in the figures.

      (2) The experimental design is rigorous and explained in sufficient detail.

      (3) The identification of a potential microexon compensatory mechanism by transcriptional alterations represents a valued attempt to begin to explain complex genetic interactions.

      (4) Overall this is a study with a robust experimental design that addresses a gap in knowledge of the role of microexons in neurodevelopment.

      Thank you very much for your positive comments to our manuscript.

      Reviewer #1 (Recommendations for the authors):

      Minor Suggestions

      (1) Axes are often scaled differently even between panels in the same figure. For example in Figure 5 - supplement 10, the srrm3_17 y axis scales from 0-20, while the neighboring panels scale from ~1-2.5. This somewhat underrepresents the finding that srrm3 mutants have much larger inter-individual distances. Similarly, in the panel above (src_1), the y-axis is scaled to include a single point around 17cm. As a result, it appears at first glance that the src_1 trials resulted in much lower inter-individual distance. Suggest scaling all of these the same to improve readability.

      While the Reviewer is certainly correct, after careful consideration we decided to have autoscaled axis to prioritize within-plot visualization (i.e. among genotypes within an experiment) than across plots (i.e. among experiments and lines).

      (2) Attention to italicizing gene names.

      Thanks.

      (3) In many points in the methods, we are instructed to "see below." Suggest directing the reader to a particular section heading.

      We found only one such instance, and we directed the reader to the specific section, as suggested.

      (4) In Methods, remove "in the corpus callosum." This is not an accurate descriptor for the site at which Mauthner axons cross.

      This is absolutely correct, apologies for this mistake.

      Clarify:

      (1) In the results section, "tissue-specific regulation was validated..." - suggest mentioning that this was performed in adult tissues / describe dissection in the methods.

      Added.

      (2) In the results section, the meaning of "no event ortholog" is not clear. Does this mean that a microexon does not have a human homolog? If so, suggest stating more clearly.

      Correct. We have added addition information.

      (3) In the results, the authors state that 78% of microexons are affected by srrm3/4 loss-offunction. Suggest stating the method used here (e.g. RNA-seq in mutants as compared to siblings)

      Added.

      (4) It is not clear what "siblings for the main founders means" for example in 3D. Is this effectively the analysis of microexon knockouts across multiple independent lines? Are the lines pooled for stats, for example in 3C?

      The main founder correspond to that listed as _1 and as default for experiments when only one found is used. We now explicitely state this.  

      For 3C, the lines are not pooled for stats; the stats correspond only to the main founder for each line. However, for each main founder line, multiple experiments are usually analyzed together and the stats are done taking their data structure into account (i.e. not simply pooling the values).

      (5) The purpose and a general description of NanoBRET assays should be included in the results.

      We added the main purpose of the NanoBRET assays (testing protein-protein interactions).

      (6) Specify that baseline behavior is analyzed in the light.

      Added.

      (7) In Figure 4A, adult fish are schematized being placed into a 96-well plate. Suggest using the larval diagram as in Figure 6 for accuracy.

      Done.

      (8) In Figure 4, plot titles could be made more accessible, especially in 4 F. Suggest removing extraneous information / italicizing gene names, etc. In G, suggest writing out Baseline, Dark, and Light to make it more accessible. Same in 4B.

      We have implemented some of the suggestions. In particular, italics were not used, since we are referring to the founder line, not the gene.

      (9) Figure 6 legend B - after (barplots), suggest inserting the word "and", to make clear that barplots indicate host gene *and* closely related paralogs are indicated by dots.

      Done.

      (10) In methods: "To better capture all microexons..." This sentence is difficult to understand. Suggested edit: "we excluded *from our calculation?* tissues with known or expected partial overlap... from comparison (for example, ...).

      Done.

      (11) In the methods, "which were defined with similar parameters but -min_rep 2." Suggest spelling this out, e.g. "with similar parameters, but requiring sufficient read coverage in at least n=2 samples per valid tissue group, whereas we only required one.".

      Done.

      (12) RNA was extracted for event and knockout validations. What does event mean here?

      Event refers to the validation of the exon regulatory pattern in WT tissues. We added this information.

      Provide definitions for abbreviations:

      (1) (Figure 6) Delta corrected VST Expression.

      Done.

      (2) "Mic-hosting genes" paralogs.

      Done.

      (3) In Figure 1F, "emic" is not defined.

      Done.

      Misspellings:

      All corrected.

      (1) Figure 6B (percentile is spelled percentil).

      (2) Figure 6B legend (bottom or top decile*).

      (3) Figure 6D - Schizophrenia* genes.

      (4) In Zebrafish husbandry and genotyping: suggest "srrm3 mutants grew more slowly.".

      (5) In results, "reduced body size at 90pdf" > 90dpf.

      Reviewer #2 (Recommendations for the authors):

      (1) Characterization of microexon mutants (Figure 2): The semi-quantitative PCR with flanking primers (Figure 2, supplement1) is well-suited to assess successful deletion of the exon and enables detection of potential mis-splicing around the alternative segment. However, it does not quantify the impact on total transcript levels. The authors should complement those experiments with qPCR measures of the transcript levels - otherwise, it is difficult to link mutant phenotypes to isoforms (as opposed to alterations in the level of gene expression). This point is somewhat addressed in Figure 6 by the RNA Seq analysis but it might help to add data specifically in Figure 2.

      As the Reviewer says, this point is explicitely addressed in Figure 6, where were show the change in the host gene's expression that follows the the removal of some microexons. We prefer to keep this in Figure 6, for consistency, as we believe this is not a direct (regulatory) consequence of the removal, but more likely a compensation effect.

      (2) Social behavior alterations in juvenile fish: The authors report "increased leadership" in Srrm3 mutant fish. However, these fish have impaired vision. Thus, "increased leadership" may simply reflect the fact that they do not perceive their conspecifics and, thus, do not follow them. The heterozygous conspecific will then mostly follow the Srrm3 mutant which appears as the mutant exhibiting an increase in leadership. Figure 5D suggests that Srrm3 del and het fish have the same ratio of "neighbor in front" which would be consistent with the hypothesis that the change in this metric is a consequence of a loss of following behavior due to a loss of vision. The authors should either adjust the discussion of this point or assess with additional experiments whether this is indeed a "social phenotype" or rather a secondary consequence of a loss of vision.

      The Reviewer is absolutely correct, and we have thus modified the short discussion directly related to these patterns.

      (3) The discussion centers on potential reasons why only mild phenotypes are observed in the single microexon mutants. One caveat of the phenotypic analysis provided in the manuscript is that it does not very deeply explore the phenotypic space of neuronal morphologies or circuit function. The behavioral and anatomical read-outs are rather coarse. There are no experiments exploring fine-structure of neuronal projections in vivo or synapse number, morphology, or function. Moreover, no attempts are made to explore which cell types normally express the microexons to potentially focus the loss-of-function analysis to these specific cell types. Of course, such analysis would substantially expand the scope of a study that already covers a large number of mutant alleles. However, the authors may want to add a discussion of these limitations in the manuscript.

      The Reviewer is correct. We aimed at covering this when referring to "(i) we may not be assessing the traits that these microexons are impacting, (ii) we may not have the sensitivity to robustly measure the magnitude of the changes caused by microexon removal". We have now added some of the specific points raised by the Reviewer as examples.

      (4) Note typos in Figure 6D: "schizoFrenia", "WNT signIalling"

      Done.

      Reviewer #3 (Recommendations for the authors):

      I only have a few minor suggestions for the authors.

      (1) It is interesting that a not insignificant number of microexon deletions (3/21) result in cryptic inclusions of intron fragments, and perhaps alludes to an as yet unreported molecular function of microexons in the regulation of host gene expression. Is it possible that microexon inclusion in these 3 genes could be important for expression? I think this requires some further discussion, as (if I'm not mistaken) microexons have thus far only been hypothesised to act as modulators of protein function, not as gene regulatory units.

      While we see that microexon removal can impact expression of the host gene (Figure 6), this is likely a compensatory mechanism (or so we suggest). We do not think these three cases are related to a putative physiological regulation, since the cryptic exons appear only in the deletion line. On the contrary, we think these are "regulatory artifacts" that originate in the nonWT mutated context. I.e. we removed the exon but some splicing signals remained in the intron, which are then recoginized by the spliceosome that incorrectly includes a different piece of the intron.

      (2) The flow of the text accompanying the molecular investigation of microexon function for evi5b and vav in Figure 3 could be improved. The text currently fades out with a speculative explanation for the lack of evi5b interaction phenotype. This final sentence could be moved to the discussion and replaced with a more general summary of the data.

      We have now swapped the order in which these results are described and leave out the discussion about evi5b's microexon function.

      (3) Is this a co-submission with Calhoun et al? If so, both papers should reference each other in the discussion and discuss the relative contributions of each.

      Done

      (4) "1 × 104 cells" in methods Nanobret paragraph should be superscript.

      Done

    1. Cyrus conquered Babylon bloodlessly and became a sort of patron of the Jews. This relationship may have enhanced the influence of Cyrus' religion, Zoroastrianism, on the development of Jewish monotheism, as we will discuss shortly. Cyrus also planned and began building infrastructure like the Royal Road.

      Cyrus is such a fascinating leader! He conquered Babylon without bloodshed, supported the Jews, and even started building amazing projects like the Royal Road. It’s wild to think how his actions might have even influenced the development of Jewish monotheism!

    1. Note: This response was posted by the corresponding author to Review Commons. The content has not been altered except for formatting.

      Learn more at Review Commons


      Reply to the reviewers

      Reviewer #1

      Evidence, reproducibility and clarity

      SUMMARY

      In this study, Fernandes and colleagues addressed the question of the role of micro-RNAs in regulating the coupling between organ growth and developmental timing. Using Drosophila, they identified the conserved micro-RNA miR-184 as a regulator of the developmental transition between juvenile larval stages and metamorphosis. This transition is under the control of the steroid hormone Ecdysone, and has been shown to be modulated in case of abnormal tissue growth to adjust the duration of larval growth in response to developmental perturbations. The relaxin-like hormone Dilp8 has been identified as a key secreted factor involved in this coupling. Here, the authors show that miR-184 is involved in the regulation of Dilp8 expression both in physiological conditions and upon growth perturbation. They propose that this function is carried out in imaginal tissues, where miR-184 levels are modulated by tissue stress. While several factors have already been involved in triggering sharp dilp8 induction at the transcriptional level, this study adds another level of complexity to the regulation of Dilp8 by proposing that its expression is fine-tunned post-transcriptionally through repression by miR-184.

      __MAJOR COMMENTS______

      Overall, the manuscript is well organized, and the logics of the experimental plan well presented. The results are clear, and I appreciate the quality of the pupariation curves. However, I believe that two main conclusions of the paper are not fully supported by the results presented in the figures: the direct regulation of dilp8 3'UTR by miR-184, and the specificity of this regulation in imaginal discs. Here I develop in more details these two aspects.

      Comment 1) The strategy of the 3'UTR sensor is not fully optimized. Indeed, in most experiments, qRT-PCR is used to assess dilp8 expression levels, although it reflects both transcriptional and post-transcriptional. Importantly, to show that post-transcriptional regulation is involved in the response to tissue damage, the levels of the 3'UTR sensor should be analyzed in discs expressing RAcs (showing at the same time that the response is cell-autonomous in the discs). The expected upregulation of the sensor should be prevented by simultaneous expression of miR-184. This approach would shed light on the relative contribution of transcriptional versus post-transcriptional regulation of dilp8 in response to growth perturbation.

      Response: We thank the reviewer for this comment. We agree that qRT-PCRs do not distinguish between transcriptional and post-transcriptional changes of dilp8 levels, in response to changes in miR-184 levels and tissue damage. In addition to the qRT-PCR data we have looked at dilp8-3’UTR-GFP reporter in response to overexpression of miR-184 in the wingdisc using patched-Gal4 driver, which show downregulation of the GFP reporter in the ptc domain (Fig 4C-D’). This suggests that dilp8 mRNA is a direct target of miR-184 by post-transcriptional regulation through its 3’UTR. Further, to confirm the specificity of the effect of miR-184 on dilp8-3’UTR, we generated a dilp8-3’UTR mutant in which the single target site for miR-184 was mutated. We show that the mutated dilp8-3’UTR reporter doesn’t show any regulation in response to miR-184 overexpression in the ptc domain of the wingdisc (Fig. 4E, E’, F, F’). This experiment confirms the specificity of the dilp8-3’UTR regulation by miR-184.

      As suggested by the reviewer we analysed dilp8-3’UTR-GFP reporter expression by overexpressing RicinA using ptcGAL4 driver in the wing imaginal disc (Fig. S6F-G’). We observed a slight but consistent increase in the dilp8-3’UTR-GFP reporter expression, indicating post-transcriptional regulation of dilp8 expression in response to tissue damage. However, the increase of reporter GFP levels observed in this experiment in response to tissue damage is mild (Fig. S6F-G’) than expected based on the qRT-PCR results (Fig S6A and B). We have added this new data to the manuscript (Fig. S6F-G’).

      We propose the following reasons to explain this result:

      a) both transcriptional and post-transcriptional regulation of dilp8 mRNA in response to developmental perturbations

      b) the data on 3’UTR reporter GFP is specifically from the ptc domain expression of RicinA, whereas for dilp8 transcript levels we have expressed RicinA in all larval imaginal tissues, or in the entire wing imaginal disc, which could be one of the reasons for the stronger effect seen on dilp8 mRNA levels

      c) we are not certain if the tubulin-promoter driven dilp8-3’UTR GFP reporter reflects post-transcriptional regulation of dilp8 by miR-184 efficiently in comparison to qRT-PCR. This is especially as the reporter-GFP-3’UTR will be expressed at very high levels due to the tubulin promoter, a majority of this reporter-GFP mRNA may not be relieved from degradation due to the moderate suppression of miR-184 in response to RicinA overexpression.

      Thus, our experiments suggest that dilp8 levels are regulated post-transcriptionally by miR-184 which contributes to pupariation delays in response to tissue damage. In support of this, we could rescue pupariation delays and dilp8 induction caused by RicinA expression using overexpression of miR-184 (Figs 5B, C). Thus, we confirm that the effect of post-transcriptional regulation by miR-184 during developmental perturbations also contributes to dilp8 induction and pupariation delays. Unfortunately, due to experimental limitations we could not perform simultaneous expression of RicinA and miR-184 to evaluate the rescue of dilp8-3’UTR-GFP sensor expression. The levels of dilp8-3’UTR sensor GFP is reduced efficiently by miR-184 overexpression (Fig 4D), which prevented us from attempting the rescue of the moderate increase of dilp8-3’UTR GFP levels in response to RicinA.

      Comment 2) In my opinion, the use of a 3'UTR sensor is not sufficient to conclude that the regulation by miR-184 is direct, as miR-184 could also regulate an intermediate factor that acts on dilp8 post-transcriptional regulation. To solve this issue, a common strategy is to generate a 3'UTR sensor with mutated binding sites that should abolish the regulation by miR-184. This mutated 3'UTR might also respond differently to tissue damage, which would strongly support the conclusions of the study.

      Response: We couldn’t agree more with the reviewer, this comment is addressed in the response to comment 1. We have confirmed the specificity of regulation of dilp8-3’UTR by miR-184 using target site mutated dilp8-3’UTR (new figures added to the manuscript Fig. 4E, E’, F, F’). We tested if the changes in dilp8 mRNA levels in response to tissue damage is post-transcriptional mediated by miR-184. We observe that there is a slight, but consistent increase of dilp8-3’UTR GFP reporter levels in the ptc domain of wingdisc in response to RicinA expression, suggesting a role for miR-184 mediated post-translational regulation of dilp8. However, we have not yet tested the mutated dilp8-3’UTR GFP reporter in response to tissue damage.

      Comment 3) Concerning the tissue-specific regulation of Dilp8 by miR-184, these results need to be strengthened. Indeed, this comes mostly from phenotypes observed with rn-GAL4. Although this is a classical tool for driving expression in imaginal discs, rn-GAL4 also drives strong expression in other tissues that could contribute to triggering a delay, such as the CNS and part of the gut (proventriculus). In our hands, some growth phenotypes in the wing obtained with rn-GAL4 could be fully reverted by blocking GAL4 in the CNS indicating that the phenotype was not wing-specific. Importantly, miR-184 seems to be highly expressed in the CNS according to FlyBase, reinforcing the possibility that it plays a role in this organ. Here I propose approaches to confirm that miR-184 mediated regulation of dilp8 and developmental timing indeed occur in the discs:

      - Another driver with less secondary expression sites could be used (pdmR11F02-GAL4), or rn-GAL4 could be combined with an elav-GAL80 to prevent expression in most neurons. - The authors could identify the source of Dilp8 upregulation in miR-184 mutants using tissue-specific qRT-PCR instead of whole larvae expression like in Fig 4A-B. - This tissue-specific upregulation could be functionally tested using a rescue experiment, in which the delay observed in miR-184 mutants could be rescued by disc-specific downregulation of Dilp8 (using pdm2-GAL4 for instance).

      Response: We are thankful to the reviewer, and agree that it is important to show that the effects that we see using rn-Gal4 are specific to imaginal discs, and not due to an effect in CNS. We tested this by expressing miR-184 sponge in the CNS. Though miR-184 is highly expressed in the larval CNS, downregulation of miR-184 specifically in the pan-neuronal background using elav-GAL4 led to no effects on pupariation timepoint. We have added this as supplementary data Figure S4. Therefore, we believe that the miR-184 downregulation phenotype in the rnGAL4 background can be mainly attributed to its role in the imaginal discs. In addition, as suggested by the reviewer we have also demonstrated that downregulation of miR-184 in the imaginal discs using rnGAL4 driver leads to an increase in dilp8 expression (Fig S5B). Thus confirming that dilp8 mRNA is enhanced in the imaginal discs by blocking miR-184.

      OPTIONAL: Because it is known that dilp8 is strongly regulated at the transcriptional level, the relative input from post-transcriptional upregulation is an important question arising from this study. Although it might be a more long-term approach, I believe that generating a Dilp8 mutant lacking its 3'UTR or, even better, with mutated miR-184 binding sites, would shed light on the role of this regulation for the response to growth perturbation and/or developmental stability (fluctuating asymmetry).

      Response: We thank the reviewer for the suggestion. This would have been an interesting experiment to carry out especially in the context of fluctuating asymmetry.

      MINOR COMMENTS

      1. __ I think that a number of results could be moved to SI as they are either controls, or reproduce published data without bringing novelty. For instance, results in Fig 5A-D are similar to data published by Sanchez et al, as stated in the text. Fig6A as well.__

      __Response: __We thank the reviewer for this suggestion, Fig. 5A-D, and F has been moved to Fig. S6A-E. We have also moved data from Fig. 6 to Fig. 5, as a result Fig 6 A-D has become Fig. 5 B-D.

      __ Fig 6D is quite mysterious, as it suggests that basal JNK activation regulates miR-184, which is different from a context of tissue damage. I think that this result could be removed. Alternatively, if the authors want to dig in that direction, more experiments should be provided, such as bskDN expression in an RAcs context and the effects on miR-184 levels and the 3'UTR sensor (since transcript levels are already published).__

      Response: We would like to clarify that our experiments suggest that endogenous JNK signalling negatively regulates miR-184, as blocking basal JNK signalling using bskDN increased the levels of miR-184 (changed to Fig 5D). Enhanced JNK signalling has been reported to be involved in tissue damage responses, and we propose that RicinA mediated increase in JNK signalling leads to the reduction of miR-184 (changed to Fig 5A, S6D-E). However, we are not strongly implying this as we did not co-express RicinA and bskDN to show that JNK signalling is responsible for the drop in miR-184 levels in response to tissue damage. We thank the reviewer for seeking this explanation, we have rewritten the results section to improve clarity.

      __ The references related to Dilp8 should be checked more in detail in the intro and discussion. About Dilp8 and developmental stability: remove the ref to Colombani et al 2012, instead put Boone et al 2016 and add Blanco-Obregon et al 2022 (in addition to Garelli et al 2012 who initially identified this phenotype. About Lgr3 as the receptor for Dilp8: add Colombani et al, Current Biology 2015, and cite here Vallejo et al 2015, Garelli et al 2015. Among the important transcriptional regulators of Dilp8, Xrp1 could be mentioned (Boulan et al 2019, Destefanis et al 2022) as it plays a complementary function to JNK depending on the type of tissue stress.__

      __Response: __We are really sorry for the glaring errors in citing appropriate references. We thank the reviewer for correcting this for us. We have made necessary changes to the text.

      Significance

      GENERAL ASSESSMENT This study provides convincing data showing that the conserved microRNA miR-184 plays a role in regulating developmental timing in Drosophila through modulating the levels of Dilp8, a key factor in the coupling between tissue growth and developmental transitions. The results are convincing, but the general conclusions of the paper need to be strengthened regarding the direct regulation of dilp8 by miR-184 and the tissue-specificity of this interaction.

      ADVANCE Dilp8 is a key factor that modulates growth and timing in response to developmental perturbations and contributes to developmental precision in physiological conditions. As such, its regulation has been studied by different groups in the last decade, leading to the identification of several inputs for its transcriptional regulation. Here, the authors uncover a post-transcriptional regulation by miR-184, adding another level of regulation of Dilp8 that contribute to ensuring proper regulation of developmental timing, and opening the possibility that miR-184 might play similar roles in other species.

      AUDIENCE This study is of interest for researchers in the field of basic science, with a focus on developmental timing, tissue damage and biological function of microRNAs.

      REVIEWER EXPERTISE Drosophila, growth control, developmental timing, Dilp8.

      Reviewer #2

      Evidence, reproducibility and clarity

      Drosophila has helped to characterize the mechanisms that coordinate tissue growth with developmental timing. The insulin/relaxin-like peptide Dilp8 has been identified as a key factor that communicates the abnormal growth status of larval imaginal discs to neuroendocrine neurons responsible for regulating the timing of metamorphosis. Dilp8, derived from imaginal discs, targets four Lgr3-positive neurons in the central nervous system, activating cyclic-AMP signaling in an Lgr3-dependent manner. This signaling pathway reduces the production of the molting hormone, ecdysone, delaying the onset of metamorphosis. Simultaneously, the growth rates of healthy imaginal tissues slow down, enabling the development of proportionate individuals.

      In this manuscript "miR-184 modulates dilp8 to control developmental timing during normal growth conditions and in response to developmental perturbations" by Dr. Varghese and colleagues, the authors identify a new post transcriptional regulator of Dilp8. The authors show that miR-184 plays a pivotal role in tissue damage responses by inducing dilp8 expression, which in turn delays pupariation to allow sufficient time for damage repair mechanisms to take effect.

      Major points:

      Comment 1) In most of the experiments for percentage of pupariation, the 50% pupariation in control is around 110 hours AED in figures 1, 2 and 3. In figures 5 and 6 using the UAS Ricin, the controls are more around 90 hours AED. Why this discrepancy?

      Response: We thank the reviewer for asking for this clarification. The former experiments for Figs 1-3 were carried out at 25oC while the latter experiments with a cold sensitive version of RicinA (UAS-RAcs), Figs 5 and 6 (now changed to Figs. 5 and S6 as suggested by reviewer #1) were carried out at 29oC (permissive temperature). This difference in temperature has led to alterations in pupariation timing. We apologise for not having mentioned this in the text, now we have made necessary corrections to the methods section clearly indicating this.

      Comment 2) What is the mechanism behind the expression of miR-184 in stress conditions? Is miR-184 also implicated in other conditions giving rise to a developmental delay (X-rays irradiation or animal bearing rasV12, scrib-/- tumors)?

      Response: We thank the reviewer for these questions.

      a) In response to developmental perturbations by RicinA, we believe that activation of JNK signalling controls miR-184 expression. We propose this as our experiments show that imaginal disc damage leads to enhancement of JNK signalling and increase in dilp8 mRNA levels (as reported earlier by Colombani et al 2012; Sánchez et al 2019), and a simultaneous reduction of miR-184 (Figs. S6A, D, E). We also have performed new experiments to show that in response to RicinA expression in the wingdisc there is moderate increase in the dilp8-3’UTR-GFP sensor expression (Figs. S6F-G’), indicating a post-transcriptional regulation of dilp8 expression in response to tissue stress. We also show that RicinA induced dilp8 expression and pupariation delay can be rescued by increasing miR-184 levels (Fig 5B and C), suggesting that the reduction of miR-184 in response to tissue damage contributes to the damage responses. In a separate experiment we show that blocking the endogenous JNK pathway by the expression of bskDN enhances miR-184 levels, suggesting that miR-184 is under the regulation of JNK signalling (Fig 5D). Hence, we speculate that during tissue stress, activation of JNK signalling leads to a reduction of miR-184 levels which contributes to regulating the levels of dilp8 post-transcriptionally and resulting in pupariation delays. The text has been modified to explain this better.

      b) In a previous paper by Shu et al., 2017 (https://doi.org/10.18632/oncotarget.22226) decreased expression of miR-184 was observed in a lglRNAi; RasV12 tumor background. Apart from this various studies have shown that dilp8 levels increase in response to tumour, radiation stress, apoptosis, and tissue damage (Yeom et al 2021, Ray et al 2019, Demay et al 2014, Katsuyama et al 2015, Colombani et al 2012, Garelli et al 2012). Whether the regulation of dilp8 by miR-184, occurs in these backgrounds is yet to be tested. We have now discussed this possibility in the manuscript.

      Comment 3) dilp8 mutant animals have also been shown to be more resistant to starvation or desiccation (https://doi.org/10.3389/fendo.2020.00461). Is miR-184 implicated in this answer?

      Response: We thank the reviewer for this question. In our earlier experiments miR-184 has been demonstrated to be regulated by nutrition in the larval stages and lack of miR-184 led to enhanced larval death in response to diet restriction (Fernandes et al., 2022). miR-184 was also demonstrated to play a role in the insulin producing cells (IPCs) in regulating lifespan (Fernandes & Varghese., 2022). In the current work, we propose miR-184 to act upstream of dilp8 in response to stress stimuli. Hence, it is possible that miR-184 might be involved in responses to starvation and desiccation stress in the adult female flies, by regulating dilp8 levels post-transcriptionally. However, it has not been tested yet if the miR-184 regulation of dilp8 plays a role in resistance to starvation or desiccation in adult females, as this was not within the scope of the current study. We have now added this reference in the discussion section.

      Comment 4) dilp8 expression has been also shown to be regulated by Xrp1 in response to ribosome stress (https://doi.org/10.1016/j.devcel.2019.03.016). This paper should be included in the manuscript. Is it possible that the expression levels of miR184 are regulated by Xrp1?

      Response: We thank the reviewer for the suggestion and have incorporated the reference into the paper. During ribosome stress in the larval imaginal discs the stress-response transcription factor Xrp1 acts through dilp8 in regulating systemic growth. We agree with the reviewer, it is possible that expression of miR-184 is regulated by Xrp1. Currently we have not explored this possibility. We have now added this to the discussion section.

      Minor points:

      1. __ Does the overexpression of miR184 induce an increased fluctuating asymmetry?__

      Response: We thank the reviewer for asking this question. The role of dilp8 in the fluctuation asymmetry is only observed in the dilp8 hypomorphic mutant background. To replicate this we would have to overexpress miR-184 in either the whole larvae or in the wing discs. Unfortunately overexpression of miR-184 in the wing discs (using rnGAL4) leads to pupal lethality while as overexpression of miR-184 in the whole larvae leads to embryonic lethality and therefore we were not be able to conclude from our experiments if miR-184 overexpression induces increased fluctuating asymmetry.

      2. There are 2 references Colombani et al. (2012 for Dilp8 and 2015 for Lgr3). Can you double check that they are used accordingly

      Response: We thank the reviewer for pointing these errors out and we have incorporated these changes into the paper.

      Significance

      Altogether, the paper present compiling lines of evidence supporting the proposed model. The experiments are well designed and are convincing. The papers is interesting and relevant for a broad audience.

      __Reviewer #3 __

      Evidence, reproducibility and clarity (Required):

      This is an interesting study demonstrating an interaction between miR-184 and the Drosophila insulin-like peptide 8 (dilp8) in the tissue damage response. The authors show that Dilp8 activity is negatively regulated by miR-184, apparently through direct interaction between miR-184 and the dilp8-3'UTR, which leads to lower dilp8 mRNA transcript levels, via an undetermined mechanism, supposedly its degradation? Furthermore, the authors show that during aberrant tissue growth, miR-184 levels are very slightly downregulated (see comment below), and based on other experiments, imply causation of this with the increased dilp8 mRNA levels that occur in these tissues, again via an unclear mechanism: upregulation or stabilization of dilp8 mRNA. The authors present evidence that the JNK pathway, which had been known to be critical for dilp8 mRNA upregulation upon tissue damage, does so via miR-184.

      Major Comments:

      __Comment 1: The data showing the direct regulation of dilp8-3'UTR by miR-184 are not very strong and would require more controls to strengthen the claim, as described below. __

      Response: We have performed new experiments to validate that dilp8-3’UTR is regulated by miR-184. Please see the detailed responses to comments 10-12 below.

      __Comment 2: The miR-184 effects are also very small (less than 2-fold reduction with tissue damage; or less than 2-fold induction with JNK-pathway inhibition via bskDN). These two points are the weakest part of the manuscript and model. __

      Response: We agree with the reviewers on this point. The reduction in miR-184 levels in response to RicinA expression is modest (25–30%), and the induction of miR-184 in response to bskDN expression is less than two-fold (Figs. 5A and D). In contrast, dilp8 transcript levels increase several-fold in response to RicinA expression (Fig. 5C, S6A and B). Since we measure dilp8 transcript levels by qPCR, we detect both transcriptional and post-transcriptional contributions to dilp8 regulation. In addition, we have performed a new experiment to check the post-transcriptional regulation of dilp8, in response to tissue damage. Though the change in the dilp8-3′UTR GFP reporter upon RicinA expression in the ptc domain of the wingdisc is mild (Figs. S6F-G’), this strongly suggests a post-transcriptional outcome of the reduction of miR-184 levels on dilp8. Hence, we propose that tissue damage induces strong transcriptional activation of dilp8, while the reduction of miR-184, despite its smaller magnitude, contributes to dilp8 upregulation via post-transcriptional regulation. In support of this, our experiments demonstrate direct regulation of the dilp8-3′UTR by miR-184 (Figs. 4C-F’), and show strong dilp8 mRNA upregulation in miR-184 deficient conditions (Fig. 4A and B), suggesting the role of miR-184 in maintaining dilp8 levels. We also show that RicinA induced effects on dilp8 and pupariation delay are reversed by co-expression of miR-184 (Fig. 5C). We do not claim that regulation by miR-184 is the sole mechanism for driving dilp8 induction during tissue damage, but suggest that miR-184-mediated post-transcriptional regulation acts in a complementary manner to transcriptional responses. Furthermore, we believe that the mild effect of JNK signaling on miR-184 (as shown by the bskDN experiment) is sufficient for the moderate reduction of miR-184 in response to tissue damage.

      Comment 3: ____Regarding the expression levels, it does not help that the authors show bar graphs with standard errors of the mean instead of the actual data points to allow reliable appreciation of the data dispersion.

      Response: We have modified our figures and have performed statistical analysis according to the suggestions of the reviewers, please see responses to comments 1-9, and 13-19.

      Comment 4: It is difficult to understand how minute changes in miR-184 levels can lead to over an order of magnitude differences (in some cases) in dilp8 mRNA levels considering that it is a stoichiometric relationship. Maybe ?miR-184-Dicer1? complexes are highly stable and re-used for multiple dilp8 transcripts - the authors could discuss how they understand this occurring in their manuscript.

      On the same line, discussion is also rather weak on what regards the mechanism of control of dilp8 mRNA levels by miR-184. Please discuss eg, the evidence for mRNA degradation induction by microRNAs with this UTR binding profile (imperfect UTR binding Fig S4) and-if appropriate-how other possible regulatory models (direct and indirect) could explain the findings.

      Response: We accept the reviewers comment that 25-30% reduction of miR-184 is low in comparison to the many fold increase in dilp8 levels. We believe that both post-transcriptional and transcriptional changes are responsible for the induction of dilp8 in response to tissue damage. However, our experiments suggest the role of post-transcriptional regulation by miR-184, as pupariation delay is rescued by miR-184 overexpression (also please see the response to the previous comment). We are not ruling out the possibility of transcriptional regulation of dilp8 mRNA, rather we are suggesting the possibility that both transcriptional and post-transcriptional means are responsible for changes in dilp8. Moreover, we have not performed absolute measurement of miR-184 in the imaginal discs (what we show is a comparison between control and RicinA expression), hence we do not have an exact estimate of how many miR-184 molecules are reduced and if they would be greatly equal or more in comparison to the dilp8 mRNA molecules that are upregulated, as again while measuring dilp8 mRNA we are not checking how many molecules of dilp8 exactly are increased. As the reviewer suggests, it is possible that miR-184-RISC could be stable to handle multiple dilp8 molecules one after the other, hence it is not a 1:1 relationship between miR-184:dilp8. We have included this in the manuscript. It is also known that imperfect 3’UTR binding as seen in most animal microRNAs leads to translational repression and mRNA deadenylation, which eventually results in mRNA degradation.

      Comment 5: ____We suggest the authors carefully revise their citations to cite appropriate work that supports the claims, and also to avoid missing the seminal studies that report the claims they cite.

      Response: We are really apologetic for the errors citing the key references. We are grateful to the reviewers for correcting this for us. We have made changes to the text to include and correct the references.

      We have the suggestions below which we hope will help the authors improve their manuscript. If the authors address these points raised above, we believe the manuscript should be a valuable contribution to the field, and help in the understanding of how tissues respond to growth aberrations and the regulation of transcript levels by microRNAs.

      Detailed Comments:

      Comment 1. Results 1st paragraph: please describe the screen in more detail. As written, one only discovers it was a miRNA loss-of-function screen when reading the legend of Table S1. Please show the original data of the screen - with dispersion if possible.

      Response: We thank the reviewers for these suggestions, we have now included the data from the screen with SEM, and p-values.

      Comment 2. Results 1st paragraph, Fourth line, "While several miRNAs caused delays in pupariation by 12 hours or more..". Please correct, as actually loss of miRNAs caused delays.

      Response: We thank the reviewer for pointing out this error, we have corrected the text accordingly.

      Comment 3. ____Results (Figure 1) - It says that data from three independent experiments are shown. However there is no dispersion in the data. Could the authors please explain this? Are the results of the three experiments summed and presented as one? or is this one of the three?

      Response: We thank the reviewers for these suggestions and have plotted data with the SEM values.

      Comment 4. It is reported in the legend of Figure S2 that LogRank test was performed to determine statistical significance. However, no statistical data is presented. Please show the results.

      __Response: __We thank the reviewers for these suggestions to improve the data presentation, we have incorporated the p-value as suggested.

      Comment 5. Fig2A and B. Please show the data points in the bar graphs (as in Figure. 2C), or choose another data representation. ____Please consider redoing statistical analysis with a simple t-test. ____It is not clear to me why ANOVA was used to compare two samples. Please state that data are normalized also to control (tub-GAL4>UAS-scramble). Please ____state____ the h post-hatching from which the RNA samples were collected (as in Fig 2C for 20HE quantification).

      __Response: __We thank the reviewers for these suggestions to improve the data presentation, we have incorporated all changes as suggested. Similar changes have been incorporated to the rest of the figures of the manuscript as well. Hours post-hatching information for each figure is now added to the figure legends. __ __

      Comment 6. Fig2C. Fig legend states the bar graphs are "absolute values". Please specify if the bar represents the average, median or something else.

      Response: We thank the reviewer for pointing this out, we have made the suggested changes.

      Comment 7. Throughout the manuscript: please use GAL4 in capital letters or at least standardize it throughout the ms. Currently there are GAL4s and Gal4s.. eg compare Fig 2 and 3 legends.

      Response: We thank the reviewer for pointing this out, we have incorporated all changes as recommended.

      Comment 8. FigS3A and B. Please revise as Fig2A and B above. and apply the same criteria in the respective figure legend.

      __Response: __We thank the reviewer for pointing this out, we have made the changes as recommended.

      Comment 9. Fig. 4 - please indicate on the figures what is whole larvae and what is wing imaginal discs. This will facilitate understanding of the figure.

      __Response: __We thank the reviewers for these suggestions and have included this information in all the figures.

      Comment 10. Fig 4 - Data - Authors do not show that rn-GAL4>miR-184-sponge causes up regulation of dilp8 mRNA levels, hence the model is weakened. Doing this experiment would significantly strengthen the study whatever the result is.

      Response: We thank the reviewer for pointing this out and we have included this in the manuscript (Fig S5B).

      Comment 11. The dilp8-3'UTR experiment is weak especially because its generation is not sufficiently well described in the manuscript. "The dilp8 3'UTR-GFP reporter line was created as described in (Vargheese & Cohen, 2007)" is not sufficient. Please describe the construct generation in sufficient detail so that the experiments can be reproduced by others.

      Response: We thank the reviewer for pointing this out and we have elaborated in the methods section on how we generated the dilp8 3'UTR-GFP reporter and dilp8 3'UTR mutant GFP reporter lines. The plasmid was originally created in Steve Cohen’s lab at EMBL, by modifying pCasper4 plasmid, by introducing a tubulin promoter, EGFP and a multiple cloning site, which allows one to clone 3’UTRs of target genes into this plasmid. Not1 and Xho1 sites were used to clone the dilp8-3’UTR and mut-3’UTR. We hope this explains our strategy sufficiently.

      Comment 12. Making assumptions, if the construct is as described in Vargheese & Cohen, 2007 and contains all of the dilp8 3'UTR - it should be a Tubulin-driven GFP gene with a dilp8-3'UTR "Tub-GFP-(dilp8 3'UTR)". In this case the authors need to rule out the alternative interpretation of the result in Fig. 4D by showing that the expression of miR-184 does not down regulate Tub-GFP expression itself. The best scenario would be to have a mutated dilp8 3'UTR for the miR-184 recognition site. This experiment would significantly strengthen the study and model.

      Response: We thank the reviewer for pointing this out. We agree with the reviewers that this experiment is needed to prove direct regulation of the dilp8-3’UTR by miR-184. We have mutated the sequences complementary to the seed region of miR-184 in the dilp8-3’UTR, and demonstrated that overexpression of miR-184 does not regulate the mutated tub-GFP-(dilp8 3'UTR) expression. This confirms that the dilp8 gene is a direct target of miR-184. This data is added to the manuscript as Figs 4E-F’.

      Comment 13. Figure 4C-D please separate dilp8 from 3'UTR with a space or hyphen.

      Response: We thank the reviewer for pointing this out and have separated dilp8 from 3’UTR with a hyphen.

      Comment 14. Figure 4E. Please name the dilp8 allele as MI00727 as it is not a KO, but rather a hypomorphic mutation (fully WT dilp8 transcripts are still generated, albeit at a much lower level).

      Response: We thank the reviewer for pointing this out and we have made the necessary changes.

      Comment ____15. Figure 6D: please add UAS to bskDN/+. All figures have rn-GAL4 alone or with UAS-GFP as control. This finding would be strengthened with this other control, especially because the size effect is small.____ This being said a general comment for all experiments is that hemi-controls are generally missing for all figures. eg, in Fig 3. One would typically include controls such as A. Phm>+ and +>miR.184; B. aug21>+ and +>miR.184; C. ptth>+ and +>miR.184; D. rn>+ and +>miR.184

      Response: We thank the reviewer for pointing this out. We have added UAS to bskDN, now Fig 5D and have also added the rnGAL4/+ control. We have also performed various hemi-control experiments as suggested by the reviewer to our best capabilities. We have added a separate graph with the hemicontrols in the as a Reviewer Response Figure 1.

      Comment 16. Figure 7: Are IPCs necessary for the model? If not, I suggest removing them and placing the Lgr3 neuron cell bodies much more anterior in this scheme. Their cell bodies are as anterior and rostral as it gets, approximately where the IPCs are depicted in this type of view of the CNS.

      Response: We thank the reviewer for pointing this out and have removed IPCs from the figure, this figure is now labelled as Fig. 6.

      Comment ____17. Table S1- It would be preferable to see the data of these experiments, but if the authors prefer to show this data in a table, please at least add the dispersion analyses (eg standard deviation.. OR median+-quartiles OR Confidence intervals..), N of animals analysed, and statistics against controls.

      Response: We thank the reviewer for pointing this out, we have added the number of larvae analysed, SEM values and statistics against the control condition.

      Comment ____18. In all figures with pupariation time: please also indicate significant findings in the graphs (with an asterisk, for instance) and adjust figure legends accordingly. This could facilitate understanding the data.

      __Response: __Thanks for the suggestion. We have incorporated this information into figure legends.

      Comment ____19. Please revise Figure legends for punctuation.

      __Response: __We have rectified all the errors in punctuation. We thank the reviewers for suggesting this.

      __Comment ____20. __

      a) Abstract:

      Line 10: What is the evidence to call Dilp8 a "paracrine" factor?

      Response: We thank the reviewer for pointing this out, we have changed the text to ‘secreted factor’.

      b) Introduction:

      4th paragraph, 3rd sentence " Dilp8... buffers developmental noise and delays pupariation..." Buffering of developmental noise was first shown in Garelli et al., Science 2012, so this publication should be cited. ____4th paragraph, 5th sentence: please include Jaszczak et al., Genetics 2016. This paper was published together with the 2015 papers, just a matter of timing that it got a 2016 date. Moreover, I do not think Katsuyama et al., 2015 is well cited to back up the statement in this sentence, hence I recommend removing that citation in this sentence.

      Response: We thank the reviewer for pointing this out and have made necessary changes.

      c) 6th paragraph: 5th line "targeting dilp8" : please specify if you mean the gene or the mRNA, or both. Same for line 7.

      Response: We thank the reviewer for pointing this out and have made necessary changes.

      d) Results Page 10, 1st paragraph, 1st sentence: the works cited are not the appropriate studies that demonstrated what is being stated. This was shown in Garelli et al., Science 2012 and Colombani et al., Science 2012. Results Page 10, 1st paragraph, line 11: Please also cite Colombani et al., Science 2012, who first showed that JNK is required for dilp8 regulation.

      Response: We thank the reviewer for pointing this out and are extremely apologetic for this oversight. We have made necessary changes to the manuscript.

      e) Discussion, 2nd paragraph, line 4: again, please indicate the rationale for using "paracrine" to describe Dilp8's activities. The current widely accepted model is that Dilp8 acts on interneurons in the brain ____(eg, reviewed in Juarez-Carreno et al., Cell Stress, 2018; Gontijo and Garelli, Mech Dev, 2018; Mirth and Shingleton, Front Cell Dev Biol, 2019; Texada et al., Genetics 2020; Boulan and Leopold, 2021).____ In order to reach the brain, Dilp8 has to be secreted from the discs and travel to the brain. This is as an endocrine mechanism as it gets for a small larva, considering that some discs can be on the opposite side of the larva (eg, genital discs). While this does not exclude that Dilp8 could also act paracrinally, the only evidence that I am aware of comes from other contexts such as during transdetermination (where Dilp8 has been proposed to work in an autocrine or paracrine fashion, via Drl in imaginal discs (Nemoto et al., Genes to Cells, 2023), however, this is not cited appropriately in this manuscript and is less related to the Lgr3-dependent pathway being studied here.

      Response: We totally agree with the reviewer and appreciate clarifying this for us. We have made necessary changes to the text.

      f) Discussion Page 13, 1st paragraph, This claim is supported by data presented in Garelli et al., Science 2012, not the other two papers. Garelli et al., 2015 shows that the Lgr3 receptor also participates in buffering developmental noise. Other studies have corroborated the Garelli et al., 2012 finding: eg, Colombani et al., Curr Biol 2015; Boone et al., Nat Commun 2016; Blanco-Obregon et al., Nat Commun 2022). Many other studies have shown that Dilp8 promotes developmental stability under tissue stress and challenges.

      Discussion Page 12, 3rd paragraph, 2nd sentence: "The Lgr3 neurons directly interact with ... PTTH ...and insulin-producing neurons" Please cite Colombani et al., 2015 and Vallejo et al., Science 2015. Vallejo et al., propose that circuit with insulin-producing neurons. In the 3rd sentence, only Jaszczak et al., 2016 is cited, whereas this claim/model comes from many studies, such as Halme et al., Curr Biol, 2010; Hackney et al., PLoS One 2012; Garelli et al. Science 2012; Colombani et al., Science, 2012; and the Lgr3 papers from 2015). Jaszczak et al., actually propose that Lgr3 is also required in the ring gland in addition to neurons.

      Discussion page 14 last paragraph,10 line, "In Aedes aegypti ....regulates ilp8 (Ling et al., 2017)". As far as I understand mosquitoes do not have a dilp8 orthologue (see for instance Gontijo and Gontijo, Mech Dev 2018; and Jan Veenstra's work). ilp nomenclature (numbering) does not follow that of Drosophila, so ilp8 is probably a typical Insulin/IGF-like peptide and is NOT an orthologue of Dilp8, a relaxin, so this citation needs to be removed or placed into the broader context of microRNA regulation of ilps.

      Response: We are really sorry for the numerous glaring errors in the references. We thank the reviewers for correcting this for us. We have made necessary changes to the text.

      Thank you for the opportunity to review your interesting work,

      Alisson Gontijo and Rebeca Zanini

      Reviewer #3 (Significance (Required)):

      If the authors address these points raised above, we believe the manuscript should be a valuable contribution to the field, and help in the understanding of how tissues respond to growth aberrations and the regulation of transcript levels by microRNAs.

      __Author’s concluding response: __

      We thank all the reviewers for the overall positive comments and suggestions that we believe have helped us to improve our manuscript. We have incorporated all the changes suggested, especially regarding errors in citing key references. We have performed most of the experimental suggestions. Also, we have modified the way in which graphs are presented, including statistical tests as suggested by the reviewers. Several controls have been performed to strengthen the manuscript further. We believe that this review process aided in significantly improving this manuscript.

    1. Note: This response was posted by the corresponding author to Review Commons. The content has not been altered except for formatting.

      Learn more at Review Commons


      Reply to the reviewers

      1. General Statements

      We thank the reviewer for their positive comments regarding the research article titled "The Ketogenic Diet Metabolite 1 β-Hydroxybutyrate Promotes Mitochondrial Elongation via Deacetylation and Improves Autism-like Behaviour in Zebrafish" by Uddin GM and colleagues. We appreciate your input, and we will address these comments as indicated below with specific responses to each point raised by reviewers.

      The main changes in the updated manuscript are as follows:

      We have revised the introduction to now incorporate additional background information on mitochondria, NAD, and mitochondrial dynamics and function. This addition aims to provide readers with a broader understanding of the mitochondrial context in relation to our study.

      Furthermore, we recognize that previous studies have explored mitochondrial function in the context of the ketogenic diet. While our specific investigation centered on mitochondrial morphology, we acknowledge the importance of comprehensively investigating mitochondrial function. To this end, we have added new data showing how BHB impacts mitochondrial oxidative phosphorylation in HeLa cells (Sup Fig 2), and how both BHB and NMN impact oxygen consumption/glycolysis in zebrafish (Fig 7).

      We have also added new behaviour analysis of the zebrafish (Fig 6), and have re-framed the discussion around neurodevelopment generally, rather than ASD specifically.

      Finally, we have now included a section in our manuscript that discusses the limitations of our study. These limitations can be further investigated to explore and characterize the full mechanistic potential behind the effects of the ketogenic diet and/or NMN on mitochondrial dynamics.

      2. Point-by-point description of the revisions

      This section is mandatory. *Please insert a point-by-point reply describing the revisions that were already carried out and included in the transferred manuscript. *

      *Reviewer #1 (Evidence, reproducibility and clarity (Required)):

      Uddin GM and colleagues presented a research article entitled 'The Ketogenic Diet Metabolite 1 β-Hydroxybutyrate Promotes Mitochondrial Elongation via Deacetylation and Improves Autism-like Behaviour in Zebrafish'. Roles of ketogenic diet (KD) and NAD+ precursors in health promotion and longevity, as well as on the alleviation of a broad range of diseases are evident. However, their roles in autism are not well done, which is the novelty of the current study. Addressing below questions will improve the quality of the paper.

      Major concerns 1. In the introduction section, a broad overview of the roles of ketogenic diet (KD) in neurodegenerative disease (and ageing, if possible) should be provided. E.g., the authors should summarize exciting progress on the use of KD to treat Alzheimer's disease in animal models (PMID: 23276384). *

      Response: Thank you for your valuable suggestion. While it is true that the KD appears to be beneficial in neurodegenerative (and other disease) models, our focus in this paper is looking at neurodevelopment, rather than all potential benefits of the KD. Nonetheless, we have addressed this comment by incorporating a brief overview of the roles of the KD in neurodegenerative diseases, including Alzheimer's disease (AD), in the introduction section of the manuscript. Specifically, we have summarized the exciting progress made in utilizing KD to treat AD in animal models, as highlighted in the suggested study. This addition helps to provide a better overview of the potential therapeutic effects of KD in neurodegenerative diseases and strengthens the introduction section of the manuscript.

      • Roles of high fat diet to treat diseases could be extended to rare premature ageing diseases. In such scenario, high fat and NAD+ boosting shared some joint mechanisms (PMID: 25440059 ). *

      Response: This information and the reference are now added to the discussion.

      *In the introduction, a more detailed introduction of NAD+ and its roles in mitochondrial homeostasis (especially mitophagy and the mitochondrial fusion-fission balance) should be included (PMID: 24813611; PMID: 30742114; PMID: 31577933). *

      Response: Although our paper focused primarily on mitochondrial fission and fusion, we have incorporated a new paragraph in the introduction to provide a more detailed introduction detailing NAD+ and its roles in mitochondrial homeostasis, specifically highlighting mitophagy. We have included the suggested references.

      • In regarding to the statement of KD increases NAD+, was it due to increased generation (to check protein levels and activities of different NAD+ synthetic enzymes, such as iNAMPT, NMNAT1-3, and NRK) and/or reduced consumption (in addition to reduced glycolysis, does KD inhibit the activities of CD38 and PARPs? In this paper, Sirtuins' activities is (are increased)). Detailed exploration of the activities of these proteins will unveil a clear molecular mechanisms on how KD affects/regulates NAD+. *

      Response: Thank you for the comment. We agree that exploring the detailed mechanism of how the ketogenic diet (KD) affects NAD+ is an interesting question that will have important implications once answered. However, fully elucidating the mechanism of action would require a more comprehensive investigation, which is beyond the scope of this current project. We have now added this as a future direction in the manuscript.

      *Fig. 1: in the NAD+ field, the normal used NR/NMN concentrations are normally high like to use 500 µM to 2-5 mM (as the NAD+ levels in cells are high). In addition to use 50 µM, the authors are strongly to have a dose-dependent study (50 µM, 500µM, 1, 2, 5 mM), and see changes of mitochondrial funciton and parameters. In this condition, NAD+ levels should be also checked. *

      Response: We have added new supplemental data showing the initial dose response of the effects of BHB and NMN on mitochondrial morphology, which led us to choosing the relevant doses for the remainder of the paper. Our objective was not to investigate the broad impacts of different NMN concentrations on mitochondrial function and parameters, or NAD+ levels. As such, we have only focused on doses where we see effects on mitochondrial morphology.

      *Fig. 2: a comprehensive characterization of mitochondrial fusion-fission should be performed. In addition to the protein evaluated, changes on other key fusion-fission proteins, like Bax, Bak, Mfn-1, Mfn-2, etc should be performed (PMID: 17035996; PMID: 24813611). *

      Response: We agree that looking at other key proteins involved in mediating mitochondrial fission and fusion could provide additional insight. Indeed, given the changes in global acetylation that we see, it is expected that some other proteins may also be regulated in this way. However, there are at least a dozen proteins involved in mediating mitochondrial fusion and fission, not to mention many more proteins that regulate these proteins. Unfortunately, it is not feasible to analyze all the proteins involved in mitochondrial fusion-fission. Moreover, looking only at protein levels, doesn't necessarily inform about the activity of any protein. Instead, we concentrated in this paper on investigating known links between protein acetylation and mitochondrial dynamics, particularly focusing on the proteins that have known links to acetylation (i.e., DRP1, OPA1, MFNs). We have added a note in the discussion acknowledging that other means of regulation could also be occurring in parallel.

      *Figs. 1-5 were focused on mitochondrial morphology, whether KD and NMN changed mitochondrial funciton should be explored, such as to use seahorse to check ECR and OCR. *

      Response: Although our question was focused on morphology, we agree that mitochondrial function is important. We have added new data showing that BHB increases basal oxygen consumption in HeLa cells (Sup Fig 2), as well as new data showing that BHB and NMN influence oxygen consumption and glycolysis in our zebrafish model (Fig 7)

      • Fig. 6: NR/NMN used in animal studies (via gavage or in drinking water in mice, and on plate for worms and flies) are normally high (e.g., in drinking water for mice could be 4-12 mM; for worms and flies are normally 1-5 mM); for zebrafish, while they are swimming in water, this reviewer concerned whether it was true that 50 µM of NMN was sufficient to show the benefit presented.*

      Response: Our data show that these doses are indeed sufficient. We did look at some higher doses for NMN, but these were toxic, leading to poor survival and were not studied further.

      *Minor concerns 1. Line 26: For 'a growing list of neurological disorders, including autism spectrum disorder (ASD)', please add AD in. *

      Response: Line 26 is part of the abstract, which we feel should be focused more on the main message of the paper, which does not involve AD. As addressed above, we have added AD as an example in the introduction.

      *Line 57: For 'with side effects such as gastrointestinal disturbances, nausea/vomiting, diarrhea, constipation, and hypertriglyceridemia being reported', rate of frequency shall be provided if any. *

      Response: We have modified the statement to indicate the relative percent of patients suffering the various side effects.

      *Reviewer #1 (Significance (Required)):

      The novelty of the current study was to investigate effects of KD and NAD+ on autism. This investigation was not performed before and thus is the novelty.

      Weakness, effects of KD and NAD+/NMN on mitochondrial function were not well-investigated and should be done. Introduction was not well done, many key information in the fields were not provided which may mislead the readers an over-evaluation of the novelty of the current study.*

      Response: As outlined above, we have edited the introduction to include additional information requested by the reviewer. Moreover, our focus in this manuscript was to look at the mechanisms underlying changes in mitochondrial morphology, not mitochondrial function per se, though this is clearly important and related. Nonetheless, as discussed above, we have also added new data showing how BHB impacts mitochondrial function.

      *My expertise lies in NAD+, mitochondria, and brain health.

      Reviewer #2 (Evidence, reproducibility and clarity (Required)):

      The study examined the effect of beta-hydroxybutyrate and nicotinamide nucleotide on mitochondrial morphology and the molecular pathways which mitigate this effect as well as the effect of these treatments on behavior in zebrafish. The study is well done and well written. The only thing I think that could be improved are the bar in the graph some the significant comparisons. It is sometimes difficult to see which groups are being compared.*

      Response: We're happy to adjust how the data is displayed in the relevant bar graphs, but it is not clear exactly what changes the reviewer would like. To some degree this will depend on the specific guideline of the final journal where we hope the manuscript will be published. As such, we have not made changes at this point.

      ***Referees cross-commenting**

      The other reviewers do have some fair comments. Multiple doses would be helpful and showing bioenergetic data would complement the morphological measurements. Additionally, behavioral assays showing changes in social behavior in the Zebrafish would provide a stronger link to ASD. *

      Response: As discussed above, we have added new information on doses and mitochondrial bioenergetics. With respect to behaviour, we have added thigmotaxis data and reworked the discussion around behaviour and neurodevelopment so that it is less specific to ASD.

      *Reviewer #2 (Significance (Required)):

      As beta-hydroxybutyrate is an important substrate for the ketogenic diet, this study helps explain the potential mechanisms in which the ketogenic diet may enhance mitochondrial function.

      Reviewer #3 (Evidence, reproducibility and clarity (Required)):

      In this paper, Uddin and colleagues have investigated components of the ketogenic diet to understand changes in both mitochondrial morphology and protein expression, and zebrafish locomotor behaviour. They investigate whether beta-hydroxybutyrate (BHB) or nicotinamide nucleotide (NMN) application can later human mitochondria in HeLA cell lines, and also recue a locomotion defect in shank3b+/- zebrafish larvae that have previously been proposed as a model for autism. This study is strengthened by showing data from two species; however the link between the HeLA cell line data and larval zebrafish is not strong. The study would be improved by assessing zebrafish mitochondrial changes after drug application, and testing more than one concentration of BH and NMN in the behavioural assay. This is an interesting study, and it is nicely written and presented. I have made some comments to strengthen the study below.

      Major comments My expertise is in modelling some aspects of autism in zebrafish. To this end I have focussed on the zebrafish part of this manuscript more fully. I have several comments related to the zebrafish experiments. 1. The changes in mitochondrial morphology, peroxisome number and mitochondrial protein levels were measured in HeLA cells and not comparable data is shown for zebrafish. The same experiments should be repeated using larval zebrafish or a zebrafish cell line. *

      Response: We chose to use HeLa cells for the mechanistic studies due to practical reasons. Cell lines offer a controlled and well-established system for investigating cellular processes and molecular mechanisms. Measuring these parameters in tissues is significantly more challenging and requires different reagents (e.g., antibodies) and methodology (electron microscopy) that are not feasible in the current study.

      On the other hand, zebrafish larvae were employed for the behavior studies, which cannot be conducted using cell lines. By utilizing zebrafish, we were able to examine the effects of beta-hydroxybutyrate (BHB) and nicotinamide nucleotide (NMN) on locomotor behavior, providing valuable insights into potential therapeutic implications for autism.

      While we acknowledge the limitations of not directly measuring mitochondrial morphology, peroxisome number, and mitochondrial protein levels in zebrafish, we believe that our study provides significant contributions to understanding the effects of BHB and NMN in zebrafish behavior. Future studies could certainly consider incorporating zebrafish-specific experiments to complement the findings in HeLa cells.

      • How did you choose the concentration of BHB and NMN to use in behavioural experiments? And the timing of application - I don't really understand why you waited 3 days after drug application to measure locomotion. *

      Response: These doses chosen initially as they were similar the doses that induced mitochondrial elongation in HeLa cells and were tolerated by the fish larvae. As we saw promising effects at these initial doses, we decided to explore them in more detail. While we agree that it would be worth comparing the effects of additional doses, as well as looking at their effects at other timepoints, such work would be a major endeavour and is beyond the scope of our initial investigations, which we feel are worth reporting in their current state.

      With respect to the treatment paradigm, fish larvae were treated 10-48 hours post fertilization, as this is a critical neurogenic developmental timepoint that is often used for exposure studies. Fish do not fully hatch until 3-4 days post fertilization, and display only minimal movement before 5 days, which is why we waited until 5 days to look at movement.

      • Do the shank3b+/- larvae show any morphological deficits? Their decrease in locomotion is striking. Is the morphology also rescued by drug application? Can you tie this to the mitochondrial changes that you observed in HeLA cells?*

      Response: We do not observe any gross changes in fish morphology that might explain a decrease in locomotion. Unfortunately, it is not feasible to look at mitochondrial morphology in the fish at this time. However, based on previous published work showing that the ketogenic diet promotes mitochondrial elongation in mouse brains (PMID:32380723), we would expect mitochondrial morphology also to be changed in the fish. Nonetheless, as we have not examined this directly in fish, we are not making this specific claim in this manuscript.

      • In figure 6A you use time spent swimming as a readout of distance. This doesn't really make sense, because without also showing speed of swimming it is not possible to know whether time and distance correlate in the same way across genotypes. This figure could be improved by showing more detail - speed of swimming, time spent immobile etc. This can easily be extracted from the films that you have already made using the ViewPoint software. *

      Response: As requested, we have reanalyzed the zebrafish movement data for a more refined analysis. In the revised version (Fig 6), we include analysis of both speed and distance travelled within a defined time. Importantly, these findings still support differences between WT and shank3b+/- fish that are restored by BHB and NMN to varying degrees.

      • Showing a change in locomotion is not enough to claim that a model is autism-like. At a minimum I think that you need to show changes in social behaviour - likely using older fish (more than three weeks) that interact with each other. Changes in locomotion can be caused by so many factors, many of which are not indicative of autism. It is important that as a field we do not simply claim that locomotion can be used as a proxy for more complex disease phenotypes. This recent review may help you with this point:* https://www.frontiersin.org/articles/10.3389/fnmol.2020.575575/full.

      Response: The reviewer makes an important point that the movement behaviour phenotypes that we see do not necessarily represent classic ASD phenotypes (i.e., repetitive behaviour, reduced sociability, and reduced communication). To begin to address this issue, we analyzed thigmotaxis, which can be a measure of anxiety. Notably, we also see differences that are reversed by BHB and NMN. However, we cannot model all ASD behaviours in a fish model, and we are not set up to look at social behaviour, especially in the young fish that we were studying. As such, even though Shank3 is a recognized ASD gene, and the shank3b+/- model we are studying is a validated ASD model (PMID: 29619162), we have re-phrased the manuscript in the context of neurodevelopment generally, rather than with respect to ASD specifically. As such, we ascribe the movement and thigmotaxis phenotypes as neurodevelopmental phenotypes that are improved by BHB and NMN.

      *For the statistics, as far as I can tell, all of the data should be analysed by ANOVA or the non-parametric equivalent followed by a post-hoc test. Please check this and add information about normality in. *

      Response: As requested, we have clarified our statistical methodology throughout the manuscript.

      For the mechanistic data, we used t-tests for direct comparisons between two groups (e.g., vehicle vs. treatment). While multiple conditions such as vehicles, NMN, BHB, or etomoxir were tested, statistical comparisons were only conducted comparisons between the vehicle and each treatment group individually. As we are not also making comparisons between treatments this is not a multiple comparison, and ANOVA is not applicable in this context. We have clarified this rationale in the manuscript to avoid any confusion.

      For the zebrafish study, where multiple factors were involved (e.g., treatments across different time points or conditions), we performed a two-way ANOVA followed by Tukey's post-hoc test to identify specific group differences. This approach was appropriate for analyzing these datasets and ensures robust conclusion.

      With respect to normality testing, all datasets were assessed for normality using the Shapiro-Wilk test, and no violations of normality were observed. The updated text now includes these details.

      *Minor comments

      1. Make sure that you refer to the fish line as shank3b+/- throughout - see abstract.*

      This has bee corrected.

      • Please add a space between all numbers and units (e.g. 5 Mm). *

      This has bee corrected.

      • There is a spelling error on line 340 page 16: finings instead of findings. *

      This has bee corrected.

      • In figure 1, if each dot represents a different sample, then there appear to be many fewer samples analysed in 1D compared to 1B. Can you comment upon this please*

      __Response: __A total of 80-150 cells were counted per condition, and the analyses were performed on 3 independent replicates with 2 independent technical replicates for each treatment condition. The quantification of mean mitochondrial branch length in Figure 1B was measured using Image-J and the MiNA plugin. The measurements were taken from three independent replicates using a standard region of interest (ROI) and randomly selected areas from each image.

      In Figure 1D, NAD+ levels were measured 24 hours after treatment of vehicle, βHB, NMN, or Eto+βHB in HeLa cells (n=3-6/group). Each sample lysate represents an independent experimental dish from which coverslips were collected for image analysis.

      The difference in sample numbers between Figure 1B and 1D arises because image analysis involves individual cells fixed and stained on coverslips, whereas the NAD assay requires the whole lysate from the entire cell culture dish. Therefore, the higher cell count in Figure 1B represents the number of cells analyzed on coverslips, while Figure 1D represents NAD levels from the lysate normalized to the protein concentration.

      *Reviewer #3 (Significance (Required)):

      I think that this will be interesting to autism researchers and it could lead to more investigation of the ketogenic diet. Some more work is needed, likely in other model organisms, before this research can be translated to human patients. *

      __Response: __We agree that the findings of our study could be of interest to autism researchers and have implications for further investigation of the ketogenic diet (KD). It is important to note that further work, including studies in other model organisms, would be beneficial before translating this research to human patients.

      Our study aimed to provide mechanistic insights into the effects of the KD on mitochondrial morphology and behavior. We recognize that the translation of research findings to human patients requires rigorous investigation, including preclinical and clinical studies. Our study contributes to the understanding of the underlying mechanisms involved in the KD's effects, laying the groundwork for future research and potential therapeutic avenues.

      We appreciate your perspective and emphasize that our intention is to provide valuable insights into the mechanisms underlying the KD's effects rather than suggesting immediate translation to human patients. Further investigation and validation in diverse models and clinical settings will be necessary before considering clinical applications.

    2. Note: This preprint has been reviewed by subject experts for Review Commons. Content has not been altered except for formatting.

      Learn more at Review Commons


      Referee #3

      Evidence, reproducibility and clarity

      In this paper, Uddin and colleagues have investigated components of the ketogenic diet to understand changes in both mitochondrial morphology and protein expression, and zebrafish locomotor behaviour. They investigate whether beta-hydroxybutyrate (BHB) or nicotinamide nucleotide (NMN) application can later human mitochondria in HeLA cell lines, and also recue a locomotion defect in shank3b+/- zebrafish larvae that have previously been proposed as a model for autism. This study is strengthened by showing data from two species; however the link between the HeLA cell line data and larval zebrafish is not strong. The study would be improved by assessing zebrafish mitochondrial changes after drug application, and testing more than one concentration of BH and NMN in the behavioural assay.

      This is an interesting study, and it is nicely written and presented. I have made some comments to strengthen the study below.

      Major comments

      My expertise is in modelling some aspects of autism in zebrafish. To this end I have focussed on the zebrafish part of this manuscript more fully. I have several comments related to the zebrafish experiments.

      1. The changes in mitochondrial morphology, peroxisome number and mitochondrial protein levels were measured in HeLA cells and not comparable data is shown for zebrafish. The same experiments should be repeated using larval zebrafish or a zebrafish cell line.
      2. How did you choose the concentration of BHB and NMN to use in behavioural experiments? And the timing of application - I don't really understand why you waited 3 days after drug application to measure locomotion.
      3. Do the shank3b+/- larvae show any morphological deficits? Their decrease in locomotion is striking. Is the morphology also rescued by drug application? Can you tie this to the mitochondrial changes that you observed in HeLA cells?
      4. In figure 6A you use time spent swimming as a readout of distance. This doesn't really make sense, because without also showing speed of swimming it is not possible to know whether time and distance correlate in the same way across genotypes. This figure could be improved by showing more detail - speed of swimming, time spent immobile etc. This can easily be extracted from the films that you have already made using the ViewPoint software.
      5. Showing a change in locomotion is not enough to claim that a model is autism-like. At a minimum I think that you need to show changes in social behaviour - likely using older fish (more than three weeks) that interact with each other. Changes in locomotion can be caused by so many factors, many of which are not indicative of autism. It is important that as a field we do not simply claim that locomotion can be used as a proxy for more complex disease phenotypes. This recent review may help you with this point: https://www.frontiersin.org/articles/10.3389/fnmol.2020.575575/full.
      6. For the statistics, as far as I can tell, all of the data should be analysed by ANOVA or the non-parametric equivalent followed by a post-hoc test. Please check this and add information about normality in.

      Minor comments

      1. Make sure that you refer to the fish line as shank3b+/- throughout - see abstract.
      2. Please add a space between all numbers and units (e.g. 5 Mm).
      3. There is a spelling error on line 340 page 16: finings instead of findings.
      4. In figure 1, if each dot represents a different sample, then there appear to be many fewer samples analysed in 1D compared to 1B. Can you comment upon this please?

      Significance

      I think that this will be interesting to autism researchers and it could lead to more investigation of the ketogenic diet. Some more work is needed, likely in other model organisms, before this research can be translated to human patients.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review)

      The weaknesses are in the clarity and resolution of the data that forms the basis of the model. In addition to whole embryo morphology that is used as evidence for convergent extension (CE) defects, two forms of data are presented, co-expression and IP, as well as a strong reliance on IF of exogenously expressed proteins. Thus, it is critical that both forms of evidence be very strong and clear, and this is where there are deficiencies; 1) For vast majority of experiments general morphology and LWR was used as evidence of effects on convergent extension movements rather than Keller explants or actual cell movements in the embryo. 2) The study would benefit from high or super resolution microscopy, since in many cases the differences in protein localization are not very pronounced. 3) The IP and Western analysis data often show subtle differences, and not apparent in some cases. 4) It is not clear how many biological repeats were performed or how and whether statistical analyses were performed. 

      (1) To more objectively assess the convergent extension phenotypes, we developed a Fiji macro to automatically quantify the LWR in various injected Xenopus embryos, as detailed in the Methods section. We acknowledge that a limitation in the current manuscript is how to link our mechanistic model at the molecular level with the actual cellular behavior during convergent extension, and we plan to perform cell biological studies in the future to elucidate the link;

      (2) We have repeated some of the imaging experiments in DMZ explants using a Zeiss LSM 900 confocal equipped with Airyscan2 detector that can increase the resolution to ~100 nm. The new data are in Suppl. Fig. 4, 9, 11, 16;

      (3) We have repeated all IP and western blots at least three times and provided quantification and statistical analyses;

      (4) We have added the information on biological repeats and statistical analyses in all figures and figure legends.

      Reviewer #2 (Public Review):

      The protein localization experiments in animal cap assays are for the most part convincing, but with the caveat that the authors assume that the proteins are acting within the same cell. As Fzd and Vangl2 are thought to localize to opposite cell ends in many contexts, can the authors be sure that the effects they observe are not due to trans interactions? 

      In our previous publication, we provided evidence that Vangl is necessary and sufficient to recruit Dvl to the plasma membrane within the same cell (Figure 3 in 10.1093/hmg/ddx095). In a more recent publication ( 10.1038/s41467-025-57658-0 ), we further elucidated a mechanism through which Dvl oligomerization switches its binding from Vangl to Fz, and determined that Dvl binding to Vangl and Fz are differentially mediated by its PDZ and DEP domain, respectively. In the current manuscript, we also performed co-IP experiment under various conditions to demonstrate binding between Dvl and Vangl. We feel that these evidences together provide a strong argument for our model where Vangl2 acts within the same cell to sequester Dvl from Fz.

      In regards to the Dvl patches induced by Wnt11 (Fig. 3 and Suppl. Fig. 9), we performed separate injection of EGFP- and mSc-tagged Dvl into adjacent blastomeres, and demonstrated that the Wnt11-induced patches arise from symmetrical accumulation of Dvl at contact of two neighboring cells (Suppl. Fig. 9a-c’). This scenario is different from epithelial PCP where Fz/Dvl and Vangl/Pk are asymmetrically accumulated at the contact between two adjacent cells.

      The authors propose a model whereby Vangl2 acts as an adaptor between Dvl and Ror, to first prevent ectopic activation of signaling, and then to relay Dvl to Fzd upon Wnt stimulation. This is based on the observation that Ror2 can be co-IPed with Vangl2 but not Dvl; and secondly that the distribution of Ror2 in membrane patches after Wnt11 stimulation is broader than that of Fzd7/Dvl, while Vangl2 localizes to the edges of these patches. The data for both these points is not wholly convincing. The co-IP of Ror2 and Vangl2 is very weak, and the input of Dvl into the same experiment is very low, so any direct interaction could have been missed. Secondly, the broader distribution of Ror2 in membrane patches is very subtle, and further analysis would be needed to firm up this conclusion. 

      (1) We repeated the co-IP experiment with Myc-tagged Vangl or Dvl. Using the same anti-Myc antibody and experimental condition (including the expression level of Vangl, Dvl and Ror2), we still found that Ror2 could be pulled down by Vangl but not Dvl (Suppl. Fig. 15b). Whereas this data confirms our previous conclusion, we acknowledge that a negative data does not fully exclude the possibility for direct biding between Ror and Dvl.

      (2) We re-analyzed the signal intensity of Dvl and Ror in Wnt11-induced patches. By quantifying the intensity ratio between Ror and Dvl along the patches, we found an increase over two folds at the border of the patches (Fig. 7j, bottom panel). We interpret this data to suggest that Ror is accumulated to a higher level than Dvl at the patch borders.     

      A final caveat to these experiments is that in the animal cap assays, loss of function and gain of function both cause convergence and extension defects, so any genetic interactions need to be treated with caution i.e. two injected factors enhancing a phenotype does not imply they act in the same direction in a pathway, in particular as there are both cis/trans and positive/negative feedbacks between the PCP proteins. 

      We agree with the reviewer that a difficulty in studying PCP/ non-canonical signaling is that both loss and gain of function of any its components can cause convergence and extension defects. Genetic interactions, especially synergistic interactions, should be interpreted with caution. But we do want to point out that, in a number of case, we were also able to demonstrate epistasis. For instance, we found that Dvl2 over-expression induced CE defects can be rescued by Pk over-expression (Fig. 1e and f), whereas Vangl/ Pk co-injection induced severe CE defects can be reciprocally rescued by Dvl2 over-expression (Fig. 1g). Likewise, we showed that Fz2/ Dvl2 co-injection induced CE defects can be rescued by wild-type Vangl2 but not Vangl2 RH mutant (Suppl. Fig. 6b), and Ror2 can rescue Vangl2 overexpression induced CE defect (Suppl. Fig. 14). Collectively, these functional interaction data consistently demonstrate an antagonism between Dvl/ Fz/ Ror2 and Vangl2/ Pk, which is correlated with our imaging and biochemical studies.

      As you can see from the reviews, the referees generally agree that your paper is a potentially valuable contribution to the field. Your observations are important because of the novel model based on the inhibitory feedback regulation between planar cell polarity (PCP) protein complexes. However, the reviewers also stated that the model is only partly supported by data because of insufficient clarity and missing controls in several experiments supporting the proposed model. The paper would be significantly improved if your conclusions are backed up by additional experimentation. Specifically, the referees wanted to see the reproducibility of the results shown in Figures 3, 4, 8, S3, S7, S12. 

      We hope that you are able to revise the paper along the lines suggested by the referees to increase the impact of your study on the current understanding of PCP signaling mechanisms. 

      We thank the reviewers for careful reading of our manuscript and for their constructive critiques and suggestions. We have repeated the animal cap studies in original Figures 3, 4, 8 and S3 with DMZ explants, and the new data are in Supplementary Fig. 9, 11, 16 and 4, respectively. We also repeated the biochemical studies in original Figure S 7and 12, and the new data are in Supplementary Fig. 8 and 15.

      Reviewer #1 (Recommendations For The Authors):

      Major points:(1) The author conducted an analysis of the subcellular localization of PCP core proteins, including Vangl2, Pk, Fz, and Dvl, within animal cap explants (ectodermal explants). To validate the model proposing that 'non-canonical Wnt induces Dvl to transition from Vangl to Fz, while PK inhibits this transition, and they function synergistically with Vangl to suppress Dvl during Convergent Extension (CE),' it is crucial to assess the subcellular localization of PCP core proteins in dorsal marginal zone (DMZ) cells, which are known to undergo CE. Notably, the overexpression of Wnt11 alone, as employed by the author, does not induce animal cap elongation. Therefore, the use of animal cap explants may not be sufficient to substantiate the model during Convergent Extension (CE). Indeed, previous knowledge indicates that Vangl2 and Pk localize to the anterior region in DMZ explants. However, the results presented in this manuscript appear to differ from this established understanding. Consequently, to provide more robust support for the proposed model, it is advisable to replicate the key experiments (Figures 3, 4, 8, and Figure S3) using DMZ explants. 

      We repeated the experiments in Figure 3, 4, 8 and Figure S3 with DMZ explant and the new data are in new Supplementary Fig. 9, 11, 16 and 4, respectively.In regards to “previous knowledge indicates that Vangl2 and Pk localize to the anterior region in DMZ explants”, we are aware Vangl/ Pk localization to the anterior cell cortex in neural epithelium from the studies by the Sokol and Wallingford labs, but are not aware of similar reports in DMZ explants. When we examined the localization of small amount of injected EGFP-mPk2 (0.1 ng mRNA) in DMZ explants, we saw a somewhat uniform distribution on the plasma membrane (Suppl. Fig. 4). In addition, in a related recent publication, we examined endogenous XVangl2 protein localization in activin induced animal cap explants that do undergo CE. What we observed was that whereas low level injected Dvl2 and Fz form clusters on the plasma member, endogenous XVangl2 remains uniformly distributed on the plasma membrane (Suppl. Fig. 3S-Z in 10.1038/s41467-025-57658-0 ). These observations may suggest potential differences of PCP protein localization during neural vs. mesodermal convergence and extension.

      (2) The author suggests that 'Vangl2 and Pk together synergistically disrupt Fz7-Dvl2 patches.' As shown in Figure 4 (panels J' to I'), it is evident that the co-expression of Pk and Vangl2 increases Fz7 endocytosis. Nevertheless, a significant amount of Fz7 still co-localizes with Dvl2. To strengthen the author's hypothesis, additional clear assay is required such as Fluorescence resonance energy transfer (FRET) assay. 

      We appreciate this valuable advice. Since none of the tagged Fz/ Dvl/ Vangl proteins we had were suitable for FRET, we made proteins tagged with mClover and mRuby2, which were reported as optimized FRET pairs. But in our hands mRuby2 seems to require very long time (~2 days) to mature and become detectable at room temperature, and is not suitable for our Xenopus experiments. We are in the process of establishing a luciferase based NanoBiT system to detect Fz-Dvl and Dvl-Vangl interactions in live cells and cell lysates, and will use it in future studies to investigate their interaction dynamics.

      For the current manuscript, we reason that a substantial reduction of Fz7-Dvl2 clusters with Vangl2/ Pk co-injection would still support our idea that Vangl2 and Pk act synergistically to sequester Dvl from Fz to prevent their clustering in response to non-canonical Wnt ligands.

      (3) The IP data is less clear and evident. A couple of examples are: a) Fig 2g where the authors report that the Vangl2 R177H variant reduced Vangl2 interaction with Pk and recruitment of Pk to the plasma membrane, but it appears that the variant interacts slightly better than WT Vangl2 with Pk. In Fig. S7a, the authors state that Pk overexpression can indeed significantly reduce Wnt11-induced dissociation of EGFP-Vangl2 and Flag-Dvl2 in the DMZ. However, there is a minimal impact when compared to the Wnt11 absent control. Based on the results presented in Fig S12a the authors indicate that Wnt11 reduces the association between Vangl2 and Dvl2, which can be discerned, but loss of Ror2 does not change this in any obvious way - but the authors indicate it does. In S12b, the authors have suggested that Ror and Dvl do not form a direct binding interaction. However, the interpretation of Figure S12b is not entirely convincing due to several issues. Notably, the expression levels of each protein appear inconsistent, the bands are not sufficiently clear, and there is the detection of three different tag proteins on a single blot. To strengthen the validity of these findings, it is advisable to repeat this experiment with improved quality. 

      We repeated all the co-IP and western blot analyses pointed out by the reviewer, and performed quantification and statistical analyses.

      Fig 2g had a mistake in the labeling and is replaced with new Figure 2g;

      Fig. S7a is replaced by new data in Supplementary Figure 8a and b;

      Fig. S12a and 12b are replaced by new data in Supplementary Figure 15a, a’ and b, respectively. In 15a and a’, we noticed a consistent decrease of Dvl2-Vangl2 co-IP in Xror2 morphant. The reason for this is not yet clear and will need further study in the future.

      Minor points: (1) In all the whole embryo injection assays examining morphology, no Western analysis is performed to show roughly equivalent and appropriate levels of the various proteins are being expressed. Differences will affect the data. 

      Although we did not do western analyses to examine the protein levels in various functional interaction assays, we did examine how co-expression of Vangl2, mPk2 or Dvl2 may impact each other’s protein levels in Supplementary Fig. 2, which did not reveal any significant change when co-injected in different combination.

      (2) The author's prior publication (Bimodal regulation of Dishevelled function by Vangl2 during morphogenesis, Hum Mol Genet. 2017) presented clear evidence of Vangl2 overexpression inducing Dvl2 membrane localization. However, Figure S4 in the current manuscript did not provide clear evidence of membrane localization. To strengthen the hypothesis that Vangl2-RH mutant also induces Dvl2 membrane localization, further comprehensive imaging analysis is needed. 

      We re-analyzed the imaging data and replaced old Figure S4 with a new Supplementary Fig. 5.

      (3) In Supplementary Figure 9, the authors propose that the overexpression of Vangl2/Pk induces Fz7 endocytosis, as indicated by its co-localization with FM4-64. However, it raises a question: how does the Fz7-GFP protein internalize into the cells without endocytosis, as seen in Figures S9a-c'? To enhance readers' understanding, a discussion addressing this point should be included. 

      We think that this might be a technical issue. As detailed in the Method section, we only incubated the embryos transiently with FM4-64 for 30 minutes, and the embryos were subsequently washed and dissected in 0.1X MMR without the dye. Therefore, only the Fz7-GFP protein endocytosed during the 30 minute-incubation would be labeled by FM-64, whereas that endocytosed before or after the incubation would not. Alternatively, the very few Fz7-GFP puncta occasionally observed in the absence of Vangl2/Pk overexpression could be vesicles trafficking to the plasma membrane.

      (4) Statistical analyses are absent for several results, including those in Figure 2f, Figure S4d, and Figure S7b. 

      We repeated these experiments and included statistical analyses. The new data are in Figure 2f, Supplementary Fig. 5d and Supplementary Fig. 8b.

      (5) This manuscript lacks any results regarding Ck1. Therefore, it is advisable to consider removing the discussion or mention of CK1. 

      We agree, and tune down the discussion on CK1 and removed CK1 from our model in Fig. 9.

      Reviewer #2 (Recommendations For The Authors):

      (1) In all the convergence and extension assays, the authors should report n numbers (i.e. number of animals), what statistical test is used, and what the error bars show. Ideally dot-plots would be used instead of bar charts as they give a better insight into the data distribution. It might be useful to give a section on the statistical analyses used in the M&M, including e.g. any power calculations carried out, as now required by many journals. 

      We have follow the advice to use dot-plots for all the quantification analyses in the manuscript. We include in the figure legends the statistical test used and what the error bars show. The number of embryos analyzed were included in each panel in the figures. We also provided more details in the Methods section on how the LWR quantification was carried out.

      (2) I think Figure 2g is wrongly labelled? FLAG bands are in all three lanes in the western blot, but not labelled as such in the schematic. 

      We corrected the schematic labeling in Figure 2g, and thank the reviewer for catching this mistake.

      (3) In Figure S7, the authors show that co-IP of Dvl and Vangl2 is reduced by Wnt11 and the effects of Wnt are blocked by Pk. Does Pk have any effect in the absence of Wnt? 

      We examined the effect of Pk over-expression on Dvl2-Vangl2 co-IP as advised, and did not see a significant impact in the absence of Wnt11 co-injection. The data is included in the new Supplementary Figure 8a. We interpret the data to suggest that “at least under the condition of our co-IP experiment, Pk may not directly impact the steady-state binding between Vangl and Dvl”.

      (4) In Figure 3, the authors show (as published previously) that Wnt11 induces patches of Dvl at the plasma membrane. It would be useful to see Dvl in the absence of Wnt and Vangl2/Dvl in the absence of Wnt. 

      Dvl is widely known as a cytoplasmic protein and its localization has been published by many labs over the past 20-30 years. In our recent publication (10.1038/s41467-025-57658-0 ), we also re-examined Dvl localization when injected at various dosages. So we did not feel it was necessary to show its localization in the absence of Wnt11 again, but included a reference to our prior publication. In regards to Vangl/Dvl distribution in the absence of Wnt11, the readers can see Suppl. Fig. 5b as an example, in addition to our previous publications referenced in the manuscript.

      (5) In the review figures, the difference in Fz7-GFP patch formation in d' and e' (vs e.g. a') is not very clear. Could the images be improved or (better) quantified in some way? 

      We assume that “review figures” refer to Figure 3 or 4? If so, we felt that Fz7-GFP patch formation was clear in Fig. 3d’, e’ or Fig. 4d’, e’. Nevertheless, we repeated these experiments in DMZ explants as advised by Reviewer 1, and additional examples of Fz7-EGFP patch formation can be seen in the new Suppl. Fig. 9d-f’ and Suppl. Fig. 11d-f’.

      (6) In Figure 6d, I'm concerned that the loss of flag-Dvl2 might occur via dephosphorylation in the IP reaction. Also the M&M don't include methodological details about buffers and whether phosphatase inhibitors were used. A compelling control would be anti-FLAG pulldown showing retention of phosphorylation. Also Figure 6f shows a reduced ratio of fast-to-slow migrating bands of Dvl with Vangl2/Pk - unless I have misunderstood, is this ratio the wrong way round? 

      We added co-IP buffer and protease inhibitor information in Methods.

      We agree that the concern about dephosphorylation during IP reaction is valid, and that direct pull down of Dvl to show the phosphorylated form is a compelling control. We therefore note that in Suppl. Fig. 8a and 15b, direct pull down of Flag-Dvl or Myc-Dvl (with anti-Flag or anti-Myc) did show the slower migrating, phosphorylated form. Additional examples in which Vangl only co-IP the faster migrating unphosphorylated Dvl include Suppl. Fig. 15a, and in a related paper we published recently (Fig. 3R and R’ in 10.1038/s41467-025-57658-0 ).

      Finally, we did wrongly label Figure 6f in the last submission, and the ratio should have been “slow/fast”. We have made the correction, and appreaicte the reviewer for the meticulousness in perusing our manuscript.

      (7) In Figure 7, what does Ror2 look like in the absence of Wnt11? 

      We included new Figure 7a-c to show that without Wnt11 co-injection, Ror2 is uniformly distributed on the plasma membrane.

      (8) Also in Figure 7, Ror2 patches are said to be slightly wider than Dvl2 patches "reminiscent of Vangl2" - I wouldn't describe them as being similar. Vangl2 shows a distinct dip in the center of the Dvl patches, Ror2 does not show a dip, and is only (at best) in a slightly wider patch, and I would want to see further examples to be convinced that the localization domain is reproducibly wider. The merge of many samples in 7d may actually be making the distribution harder to see and if the Xror2 and Dvl2 intensities were normalized I'm not sure how different the curves would appear. (i.e. the Xror2 curve looks like a flattened version of the Dvl2 curve). 

      We have added an additional panel in the new Figure 7j to compare the intensity ratio of Ror/ Dvl2 along the patches, and this analysis reveals an over two folds increase of the ratio at the border region. This quantification may make a more convincing argument that at the patch border region, Dvl is diminished whereas Ror2 accumulate with Vangl2. 

      (9) In Figure S12a, the authors suggest Wnt11 induced dissociation of Dvl from Vangl2 (by co-IP), and this is reduced after Ror2 MO. This would be more convincing with replicates and quantitation. 

      We have repeated this experiment with Vangl2 pull down and added quantification. The data is in the new Suppl. Fig. 15a.

      (10) In Figure S12b, the authors suggest Ror2 can co-IP Vangl2 but not Dvl. This is not very convincing, as the Dvl input band is very weak, and the Vangl2 co-IP band is very weak. 

      We repeated the co-IP experiment with Myc-tagged Vangl or Dvl. Using the same anti-Myc antibody and experimental condition (including the expression level of Vangl, Dvl and Ror2), we still found that Ror2 could be pulled down by Vangl but not Dvl (Suppl. Fig. 15b).

      (11) "Prickle" spelled "Prickel" in the abstract (and abbreviated to "PK" not "Pk" at one place in the abstract and several places in text) 

      We have corrected these typos.

      (12) Quite a lot of interesting observations are in supplemental figures. Normally it might be expected that extra data supporting a conclusion would be in supplemental, but here some of the supplemental data feels like it is more than simply additional evidence. For instance supplemental Figures 2 and 3 feel more than just supplemental (and Supplemental Figure 3 if merged with Figure 2 would make it easier for the reader). Moreover, for example, the description of the results in Figure 2 is punctuated by references to supplemental Figures 4 and 5 that contain key data to support the conclusions, which means the reader has to flick backwards and forwards from place to place in the manuscript to follow the argument. It is of course up to the authors, but in some cases putting supplemental data back into the main figures (for which there is no size or number limit) would increase clarity. 

      These are excellent points; in the resubmitted manuscript we have a total of 24 data figures, and we used 8 as main figures since we felt that they provide the most relevant and conclusive evidence to our model. We will consult the copy editors at eLife on how to arrange the rest as main vs. supporting figures when requesting publication as version of record.

    1. Note: This response was posted by the corresponding author to Review Commons. The content has not been altered except for formatting.

      Learn more at Review Commons


      Reply to the reviewers

      We thank the reviewers for their thoughtful comments and overall very supportive feedback.

      Reviewer #1 writes: "The study is very thorough and the experiments contain the appropriate controls. (...) The findings of the study can have relevance for human conditions involving disrupted mitochondrial dynamics, caused for example by mutations in mitofusins." Reviewer #2 writes: "The dataset is rich and the time-resolved approach strong." Reviewer #3 writes: "I admire the philosophy of the research, acknowledging an attempt to control for the many possible confounding influences. (...) This is a powerful and thoughtful study that provides a collection of new mechanistic insights into the link between physical and genetic properties of mitochondria in yeast."

      We address all points below. We have not yet updated our text and figures since we expect substantial additions from new experiments. But we have included Figure R1 with some additional analyses of existing data at the bottom of the manuscript.

      Reviewer1

      1.1 Statistical comparisons are missing throughout the manuscript (with the exception of Fig. 2c). Appropriate statistical tests, along with p-values, should be used and reported where different gorups are compared, for example (but not limited to) Fig. 3d and most panels of Fig. 4.

      We initially decided not to add too many extra labels to the already very busy plots, given that the magnitude of change mostly speaks for itself. However, we will try to find meaningful statistical tests together with a sensible graphical representation for all of the figures. For one example see Figure R1A.

      1.2. I do not agree with the use of Atp6 protein as a direct read-out of mtDNA content. While Atp6 protein levels will decrease with decreasing mtDNA content, the inverse is not necessarily true: decreased Atp6 protein levels do not necessarily indicate decreased mtDNA levels, because they could alternatively or additionally be caused by decreased transcription and/or translation. Therefore, please do not equate Atp6 protein levels to mtDNA levels, and instead rephrase the text referencing the Atp6 experiments in the Results and Discussion sections to measure "mtDNA expression" or "mt-encoded protein" or similar. For example, on p. 14 line 431 should read "mtDNA expression" rather than "decreased synthesis of mtDNA", and line 440 on the same page "mean mtDNA levels" should be "mtDNA expression" or similar.

      All three reviewers agree that using Atp6-NG as a direct proxy for mtDNA requires more validation, or at least rephrasing of the text. We agree that this is the most important point to address. We had previously tried using the mtDNA LacO array (Osman et al. 2015) to directly assess the amount of nucleoids per cell. However, the altered mitochondrial morphology of the Fzo1 depleted cells combined with the LacI-GFP which is still in mitochondria even when mtDNA is gone, increases the noise level to a point that we cannot interpret the signal. However, as this manuscript was in the submission process, the Schmoller lab (co-authors #2 and #7) adapted the HI-NESS system to label mtDNA in live yeast cells(Deng et al. 2025). This system promises much better signal to noise and we expect we can address all concerns regarding the actual count of nucleoids per cell. Should this unexpectedly fail for technical reasons, we will try to calibrate the Atp6-levels with DAPI staining at defined time points and will rephrase the text as the reviewer suggests.

      1.3. In Fig. 3, the authors use the fluorescence intensity of a mitochondrially-targeted mCardinal as a read-out of mitochondrial mass. Please provide evidence that this is not affected by MMP, either with relevant references or by control experiments (e.g. comparing it to N-acridine orange or other MMP-independent dyes or methods).

      Whether or not the import of any mitochondrial protein is dependent on the MMP depends largely on the signal sequence. The preSu9-signaling sequence was previously characterized as largely independent of the MMP compared to other presequences (Martin, Mahlke, and Pfanner 1991), which is why Vowinckel (Vowinckel et al. 2015) and others (Di Bartolomeo et al. 2020; Perić et al. 2016; Ebert et al. 2025) have previously used this as a neutral reference to the strongly MMP-dependent pre-Cox4 signal to estimate MMP. As one control in our own data, we consider that the population-averaged mitochondrial fluorescent signal Figure S3C stays constant in the first few hours, in agreement with the total averaged mitochondrial proteome (Fig R1E). As additional controls, we plan to compare the signal to an MMP independent dye as the reviewer suggests.

      1.4. In Fig. 2e-f, the authors use a promoter reporter with Neongreen to answer whether the reduced levels of the nuclear-encoded mitochondrial proteins Mrps5 and Qcr7 are due to decreased expression or to protein degradation, and find no evidence of degradation of the Neongreen reporter protein. However, subcellular localization might affect the availability of the protein to proteases. Although not absolutely required, it would be relevant to know if the Neongreen fusion protein is found in the same subcellular compartment as Mrps5 and Qcr7 at 0h and 9h after Fzo1 depletion.

      Here, it seems we need to explain the set-up and interpretation of the data better. The key point we are trying to make with the promoter-Neongreen construct is that the regulation is not mainly at the level of transcription. We are showing that the reduction in the levels of the actual protein (orange bars) is not (mainly) explained by a reduction in expression, since the promoter is similarly active at 0 and at 9 hours (grey bars). If expression from the promoter were strongly reduced, the Neongreen would be diluted with growth and would also decrease, but this is not the case. The fluorophore itself is just floating around in the cytosol and is not subject to the same post-translational regulation as Mrps5 and Qcr7, so there is no reason to expect degradation.

      1.5. Fzo1 depletion leads to a very rapid drop in MMP during the first hour of depletion. In the Discussion, can the authors speculate on the possible mechanism of this rapid MMP drop that occurs well before mtDNA or mt-encoded proteins are decreased in level?

      This is indeed an interesting point. We think there are likely three reasons causing this initial drop: Firstly, due to the fragmentation the mixing of mitochondrial content is disturbed and smaller fragments may have suboptimal stoichiometry of components (see also (Khan et al. 2024) who look at this in detail including the Fzo1 deletion); secondly, already fairly early, some mitochondrial fragments may not contain any mtDNA and therefore will be unable to synthesize ETC proteins; thirdly, altered morphological features like changes in the surface-to-volume ratios may play a role. Sadly, mechanistically following up on this is not possible with the tools in our hands and therefore outside of the scope of this manuscript. But we are happy to include these speculations in our discussion.

      1.6. In Fig. 2a, the mtDNA copy number of Fzo1-depleted cells is ca 1.3-fold of the control cells at the 0h timepoint. Why might this be? Is it an impact of one of the inducers? If so, we might be looking at the combination of two different processes when measuring copy number: one that is an induction caused by the inducer(s), and the other a consequence of Fzo1 depletion itself.

      We believe that this 30% increase is within the noise of the experiment rather than an effect of the induction. Since we normalize to t=0 uninduced, the first black data point does not have error bars, emphasizing this difference. None of the protein data suggests that there is an increase in mtDNA encoded proteins (see e.g. 2B, or Atp6 fluorescence data). In the planned HI-NESS experiment, we will see in our single cell data whether there is an actual increase in mtDNA upon TIR induction. Additionally, we will run a qPCR to carefully determine mtDNA levels of untreated wild-type cells, tetracycline treated wild-type cells and tetracycline induced TIR expressing cells to exclude effects of tetracycline as well as the expression of TIR on mtDNA.

      Minor comments:

      1.7. p. 3, line 71: "ten thousands of dividing cells.." should be "tens of thousands of dividing cells".

      Thank you, will correct.

      1.8.-p.4, line 116: please be even more clear with what the "depleted" cells and controls are treated with: are depleted cells treated with both inducers, and controls with neither?

      We will make this more clear. Depleted cells are treated with both inducers, the control cells are not. However, in Figure 1A and in S1 we do controls to show that inducing TIR per se or adding aTC per se does not change growth rate or mitochondrial morphology. We will make this more clear.

      1.9. -p.5, lines 147-148: the authors write "the rate with which the abundance of Cox2 and Var1 proteins decreases was similar to the rate of mtDNA loss" though the actual rate is not shown. Please calculate and show rates for these processes side by side to make comparison possible, or alternatively rephrase the statement.

      Indeed this was not phrased well. We will call it dynamics rather than rates.

      1.10. -Fig. 2d: changing the y-axis numbering to match those in panels a and b would facilitate comparisons.

      Makes sense, we will change this.

      1.11. Fig. 2e: it is recommended to label the western blot panels to indicate what protein is being imaged in each (Neongree,, Mrps5, Qcr7).

      We will adapt the labelling to make it more clear.

      1.12. -p.9, line 262: I suggest referencing Fig. 4e at the end of the first sentence for clarity.

      We will modify the sentence as suggested.

      1.13. -In the sections related to Fig. 3a and Fig. 5a as well as the connected supplemental data, the authors discuss both the median and the mean of mitochondrial mass and Atp6 protein, respectively. For purposes of clarity, I suggest decreasing the focus on the mean (that is provided only in the supplemental data) and focusing the text mainly on the median. The two show differing trends and it is very good that both are shown, but the clarity of the text can be improved by focusing more on the median where possible.

      We will check the phrasing and simplify.

      1.14. -p. 14, line 435: the statement that mt mass is maintained over the first 9h of depletion is only true for the mean mt mass, not for the median. Please make this clear or rephrase.

      We will check phrasing, make it more clear and also point out the extended proteomics data (see Fig R1), which corresponds to the mean of the populations

      1.15.-p.14, line 452: "mitofusions" should be "mitofusins".

      Thanks for catching this.

      Reviewer 2:

      2.1. While inducible TIR is used to reduce background, the manuscript should rigorously exclude auxin/TIR off-targets (growth, mitochondrial phenotypes, gene expression). Please include full matched controls: (plus minus)auxin, (plus minus)TIR, epitope tag alone, and a degron control on an unrelated mitochondrial membrane protein.

      We agree that rigorous controls are crucial for the interpretation of the results. However, we think we have already included most of the controls the reviewer is asking for, but we might have not pointed this out clearly enough. For example, in Fig 1A, we could make it more clear by adding more labels in which samples we added aTC, which is only described in the figure legend.

      Here is a list of all the controls:

      • Each depletion experiment is always matched with an experiment of the same strain without induction. So the genetic background as well as effects such as light exposure, time spent in the microfluidics systems, etc are controlled for.
      • Figure S1D shows that the growth rate is wildtype like in a strain containing either the AID tag or the TIR protein AND upon addition of both chemicals. It also shows that the final genetic background (AID-tag and TIR) also grows like wildtype if the inducers are not added. This conclusively shows that neither the tags/constructs nor the chemicals per se affect growth rate
      • In Figure S1C we show the mitochondrial morphology of the same controls. We will make sure to label them more consistently to match panel D, and include an actual wildtype and a FLAG-AID-Fzo1 strain without TIR treated with both aTC and 5-Ph-IAA as direct comparison
      • In figure 1A we compare the Fzo1 protein levels of a strain with and without TIR. We show that in absence of TIR, adding either aTC or Auxin does not change Fzo1 levels and that the levels are comparable in the strain that is able to deplete Fzo1 directly before addition of 5-Ph-IAA (after 2 h of induction of TIR through addition of tetracycline)
      • Additionally, in Figure S2C we show that two hours after adding aTC, the entire proteome does not change significantly apart from a strong induction of TIR. We can also make this more clear in the figure legend.
      • Additionally, we will run a qPCR to carefully determine mtDNA levels of untreated wild-type cells, tetracycline treated wild-type cells and tetracycline induced TIR expressing cells to exclude effects of tetracycline as well as the expression of TIR on mtDNA. (also in response to 1.6.) In summary, we think we have controlled sufficiently for all confounding parameters and most importantly showed that addition of either aTC or Auxin as well as the FLAG-AID tag per se does not disturb mitochondria or cell growth. We do not see what a degron control on an unrelated protein will tell us. Depending on the nature of the protein, it may or may not have a phenotype that may or may not be related to morphology changes etc.

      2.2. The Mitoloc preSu9 vs Cox4 import ratio is only a proxy of mitochondrial membrane potential (ΔΨm) and itself depends on mitochondrial mass, protein expression, matrix ATP, and import saturation. The authors need to calibrate ΔΨm with orthogonal dyes (TMRE/TMRM) and pharmacologic titrations (FCCP/antimycin/oligomycin) to generate a response curve; show that Mitoloc tracks dye-based ΔΨm across the relevant range and corrects for mass/photobleaching. Report single-cell ΔΨm vs mass residuals.

      We completely agree that the MitoLoc system is only a rough proxy for the actual membrane potential. That is why we make no quantitative claims on the absolute value or absolute difference between groups of cells. We also make very clear in Fig 3B what we are actually measuring and can emphasize again in the text that this is only a proxy. We agree that it is a good idea to compare MitoLoc values to TMRE staining as the reviewer suggests, we will do these experiments in depleted and control cells at different timepoints. Please note though that also dye staining has its caveats, especially in dynamic live cell experiments. TMRM for example is not compatible with the acidic pH 5 medium that is typically used for yeast and subjecting cells to washing steps and higher pH may change both morphology of mitochondria and the MMP, especially in cells that are already “stressed”. We prefer not to complete elaborate pharmacological titration experiments because firstly, this was extensively done in the original MitoLoc paper by the Ralser lab ((Vowinckel et al. 2015), cited 120 times); secondly, the value of the MMP is not the most critical claim of the manuscript. See also 3.12. Please note that in Figure S4D we had already plotted MMP vs mitochondrial concentration.

      2.3. To use Atp6-mNeon as a proxy for mtDNA is an assumption. Interpreting Atp6 intensity as "functional mtDNA" could be confounded by translation, turnover, or assembly. Please (i) report mtDNA copy number time courses (you have qPCR), nucleoid counts (DAPI/PicoGreen or TFAM/Abf2 tagging), and (ii) assess translation (e.g., 35S-labeling or puromycin proxies) and turnover (proteasome/AAA protease inhibition, mitophagy mutants -some data are alluded to- plus mRNA levels for mtDNA-encoded genes). This will support the "reduced synthesis" versus "increased degradation" conclusion.

      We agree with all three reviewers that Atp6 is only a proxy for mtDNA (Jakubke et al. 2021; Roussou et al. 2024) and the correlation should be checked more carefully. We will use the very recently established Hi-NESS system to follow nucleoids/ mtDNA during depletion experiments. See detailed reply to 1.2.

      (ii) in Figure 2C we inhibit mitochondrial translation and show that in this case control and depleted cells have the same level of Cox2, at least suggesting that degradation is not the key mechanism controlling the levels of mtDNA encoded proteins. We cannot do proteasome inhibitor assays since the nature of the AID-TIR systems requires an active proteasome. In figure S5C we show that the Atp6 depletion is similar in an atg32 deletion. This does not completely exclude a contribution of mitophagy to the observed phenotype, but does confirm that mitophagy is not the primary reason for cells becoming petite.

      2.4. The promoter-NeonGreen reporters argue against transcriptional down-regulation of nuclear OXPHOS. Please add mRNA (RT-qPCR/RNA-seq) for representative genes and a pulse-chase or degradation-pathway dependency (e.g., proteasome/mitophagy/autophagy mutants) to firmly assign active degradation. The authors need to normalize proteomics to mitochondrial mass (e.g., citrate synthase/porin) to separate organelle abundance from protein turnover.

      While we are happy to perform qPCR experiments for selected genes, a full RNA-seq experiment seems outside the scope of this study. As explained above, a proteasome inhibitor experiment is not possible in this set-up. Bulk mitophagy/autophagy seems unlikely to be the cause of the decrease of the nuclear-encoded OXPHOS proteins, since most other mitochondrial proteins do not decrease on average on population level in the first hours. This data is now plotted as additional figure (see below) and will be included in the supplementary of the revised manuscript (Fig R1E).

      2.5. Using preSu9-mCardinal intensity as "mitochondrial concentration" is sensitive to expression, import competence, and morphology/segmentation. The authors should provide validation that this metric tracks 3D volume across fragmentation states (e.g., correlation with mito-GFP volumetrics; detergent-free CS activity; TOMM20/Por1 immunoblot per cell).

      We agree that this is an important point and the co-authors discussed this point quite intensively. In figure S3A and B we show (using confocal data) that there is a very strong correlation between the total fluorescence signal and the 3D volume reconstruction. However, the slope of the correlation is different between tubular and fragmented mitochondria (compare panels A and B) and see figure legend. Since we are dealing with diffraction-limited objects it is likely that the 3D reconstruction is sensitive to morphology, especially if mitochondria are “clumping”. We therefore think that the total fluorescence signal is actually a better estimate of mitochondrial mass per cell than the 3D volume reconstruction (especially for our data obtained with a conventional epifluorescence microscope). The mean of the total mitochondrial fluorescence also better matches the population average mitochondrial proteome (Fig R1E). To consolidate this assumption, we will additionally compare our data to a strain with Tom70-Neongreen and to MMP independent dyes.

      Notably, since the morphology is similarly altered in mothers and buds this is of minor impact for our main point – the unequal distribution between mother and buds.

      2.6. The unequal mother-daughter distribution is compelling, but causality remains inferred. Test whether modulating inheritance machinery (actin cables/Myo2, Num1, Mmr1) or altering fission (Dnm1 inhibition) modifies segregation defects and rescues mtDNA/Atp6 decline. Complementation with Fzo1 re-expression at defined times would help order the phenotype cascade.

      We agree that rescue experiments would be very useful. We have some preliminary data for tether experiments, for example with Num1. The general problem is that the fragmented mitochondria clump together. We have not found a method to restore an equal distribution between mother and daughter cells. We will try to optimize the assay, but are not overly confident it will work. Mmr1 deletion aggravates the Fzo1 phenotype, likely also because the distribution becomes even more heterogeneous, but we have not rigorously analyzed this.

      We like the idea of the Fzo1 re-expression and will run such experiments. This will be especially powerful in combination with the new HI-NESS mtDNA reporter. We may be able to track exactly when cells reach the point-of-no return and become petite. This will also help connecting our mathematical model more directly to the data.

      2.7. The model is useful but should include parameter sensitivity (segregation variance, synthesis slopes, initial nucleoid number) and prospective validation (e.g., predict rescue upon partial restoration of synthesis or inheritance, then test experimentally).

      We will refine our model to include the to-be-measured nucleoids/mtDNA values. We will include a parameter sensitivity analysis with the updated model.

      Reviewer 3:

      3.1. About the use of Atp6 as a good proxy for mtDNA content. This is assumed from l285 onwards, based on a previous publication. As the link is fairly central to part of the paper's arguments, and the system in this study is being perturbed in several different ways, a stronger argument or demonstration that this link remains intact (and unchanged, as it is used in comparisons) would seem important.

      We agree, see 1.2.

      3.2. About confounding variables and processes. The study does an admirable job of being transparent and attempting to control for the many different influences involved in the physical-genetic link. But some remain less clearly unpacked, including some I think could be quite important. For example, there is a lot of focus on mito concentration -- but given the phenotypes are changing the sizes of cells, do concentration changes come from volume changes, mito changes, or both? In "ruling out" mitophagy -- a potentially important (and intuitive) influence, the argument is not presented as directly as it could be and it's not completely clear that it can in fact be ruled out in this way. There are a couple of other instances which I've put in the smaller points below.

      Thank you for acknowledging our efforts to show transparent and well-controlled experiments! We address each of the specific points below.

      3.3. full genus name when it first appears

      We will add the full name.

      3.4. I may be wrong here, but I thought the petite phenotype more classically arises from mtDNA deletion mutations, not loss? The way this is phrased implies that mtDNA loss is [always] the cause. Whether I'm wrong on that point or not, the petite phenotype should be described and referenced.

      We can expand the text and cite additional relevant papers. The term “petite” refers to any strain that is respiratory incompetent and leads to small colonies (not necessarily small cells!) (Seel et al. 2023). This can be mutations or gene loss (fragments) on the mtDNA (these are called cytoplasmic petite), or chemically induced loss of mtDNA (e.g. EtBr), or mutations of nuclear genes required for respiration (these are termed nuclear petite; some nuclear petites show loss of mtDNA in addition to the mutation in the nuclear genome) (Contamine and Picard 2000).

      3.5. para starting l59 -- should mention for context that mitochondria in (healthy, wildtype) yeast are generally much more fused than in other organisms

      ok.

      3.6. Fig 1C -- very odd choice of y-axis range! either start at zero or ensure that the data fill as much vertical space of the plot as possible

      True, this was probably some formatting relic. We will adapt the axis to fill the full space. Most of our axes start at 0, but that doesn’t make so much sense here, since we consider the solidity in the control as “baseline”.

      3.7. "wild-type like more tubular mitochondria" reads rather awkwardly. "more tubular mitochondria (as in the wild-type)"?

      Thank you, sounds better.

      3.8. l106 -- imaging artefacts? are mitos fragmenting because of photo stress? -- this is mentioned in l577-8 in the Methods, but the data from the growth rate and MMP comparison isn't given -- an SI figure would be helpful here. It would be reassuring to know that mito morphology wasn't changing in response to phototoxicity too.

      In the methods we just briefly point out that we have done all our “due diligence” controls to check that we do not generate phototoxicity, something that we highlight in the cited review. We do not explicitly have a figure for this, but figure S1A shows that the solidity of the mitochondrial network in control cells stays the same over 9 hours, even though these cells are exposed to the same cultivation and imaging regime as the depleted cells. We will also add a picture of control cells after 9 h. In S1B we show that control cells containing TIR but no AID tag treated with both chemicals imaged over 9 hours also show the same solidity (~mitochondrial morphology) as untreated control. Also, the doubling times of cells grown in our imaging system (Fig R1B) are very similar to the shake flask (Fig R1A). All in all, we are very confident that our imaging settings did not impact our reported phenotypes.

      3.9. para l146 -- so this suggests mtDNA-encoded proteins have a very rapid turnover, O(hours) -- is this known/reasonable?

      Reference (Christiano et al. 2014) suggests that respiratory chain proteins are shorter lived than the average yeast protein. However, based on Figure 2C we think the dynamics mostly speak for a dilution by growth.

      3.10. section l189 -- it's hard to reason fully about these statistics of mitochondrial concentration given that the petite phenotype is fundamentally affecting overall cell volume. can we have details on the cell size distribution in parallel with these results? to put it another way -- how does mitochondrial *amount* per cell change?

      This is a good point. We report mostly on mitochondrial “concentrations” because we think this is what the cell actually cares about (mitochondrial activity in relationship to cytosolic activity). But we will include additional graphs on mitochondrial amount as well as size distributions (Fig R1C, related to Fig 4F). We can already point out that the size distribution of the population does not change much in the first hours. The “petite” phenotype refers to small colonies on growth medium with limited supply of a fermentable carbon source, not to smaller size of single cells.

      3.11. l199 the mean in Fig S3C certainly does change -- it increases, clearly relative both to control and to its initial value. rather than sweeping this under the carpet we should look in more detail to understand it (a consequence of the increased skew of the distribution)?

      This relates somewhat to the previous point. The increase in average concentration is not due to an increased amount in the population, but due to the fact that it is the small buds that get a very high amount of the mitochondria which “exaggerates” the asymmetric/heterogenous distribution. This will be clarified by the figures we mention in the point above.

      3.12. para line 206 -- this doesn't make it clear whether your MMP signal is integrated over all mitochondria in the cell, or normalised by mitochondrial content? this matters quite a lot for the interpretation if the distributions of mitochondrial content are changing. reading on, this is even more important for para line 222. Reading further on, there is an equation on l612 that gives a definition, but it doesn't really clarify (apologies if I'm misunderstanding).

      For each cell, we basically calculate the relative mitochondrial enrichment of the MMP sensitive vs the MMP insensitive pre-sequence.

      So, MMP= (total intensity of mitochondrial pre-Cox4 Neongreen/ total intensity of mitochondrial pre-Su9 Cardinal) / (total cytosolic pre-Cox4 Neongreen/ total cytosolic pre-Su9 Cardinal).

      We calculate this value for each cell, but we do not have the optical resolution to calculate it for individual mitochondrial fragments.

      Both constructs are driven by the same strong promoter, so transcription of the fluorophore should never limit the uptake. Also, in Figure 3D we compare control and depleted cells with similar total mitochondrial concentration, so the difference must be due to a different import of the two fluorophores, see also Fig S4D. The calculated “MMP” value is of course only a crude proxy for the actual membrane potential in millivolts and we do not want to make any claims on absolute values or quantitative differences. But essentially what we are interested in is “mitochondrial health/activity” and we think the system is good at reporting this. See also 2.2.

      3.13. l230 -- a point of personal interest -- low mito concentrations are connected to low "function" (MMP) and give extended division times -- this is interestingly exactly the model needed to reproduce observations in HeLa cells (https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1002416). That model went on to predict several aspects of downstream cellular behaviour -- it would be very interesting to see how compatible that picture (parameterised using HeLa observations) is with yeast!

      Thank you for pointing out your interesting paper, which we will include in our discussion. Another recent preprint about fission yeast (Chacko et al. 2025) also fits into this picture. Since you were kind enough to disclose your identity, we would be happy to discuss this further with you in person if we can maybe follow-up on this.

      3.14. l239 "less mitochondria" -- a bit tricky but I'd say "fewer mitochondria" or "less mitochondrial content"

      Thanks, we will think about how to best rephrase this, probably less mitochondrial content.

      3.15. Section l234 So here (and in Fig 4) the focus is on overall distributions of mitochondrial concentration in different cells (mother-to-be, mother, bud; gen 1, gen >1). But we've just seen that one effect of fzo1 is to broader the distribution of mitochondrial concentration across cells. Can't we look in more depth at the implications of this heterogeneity? For example in Fig 4F (which is cool) we look at the distribution of all fzo1 mothers-to-be, mothers, and buds. But this loses information about the provenance. For example, do mothers-to-be with extremely low mito concentrations just push everything to the bud, while mothers-to-be with high mito concentrations distribute things more evenly? It would seem very easy and very interesting to somehow subset the distribution of mothers-to-be by concentration and see how different subsets behave

      This is a good point. When analyzing the data, we pretty much plotted everything against everything and then chose the graphs that we think will best guide the reader through the story-line. We can make additional supplementary plots where we show the starting concentrations/amounts of the mother in relationship to the resulting split ratio at the end of the cycle (Fig R1D).

      3.16. l285 -- experimental design -- do we know that Atp6 will continue to be a good proxy for functional mtDNA in the face of the perturbations provided by Fzo1 depletion? Especially if there is impact on the expression of mitoribosomes, the relationship between mtDNA and Atp6 may look rather different in the mutant?

      This is actually our top-priority experiment now. We will use the HI-NESS system and possibly DAPI staining to make a more direct link to mtDNA/ nucleoid numbers, see 1.2.

      3.17. l290 -- ruled out mitophagy. This message could be much clearer. Comparing Fig S5C and Fig 3A side-by-side is a needlessly difficult task -- put Fig 3A into Fig S5. Then we see that when mitophagy is compromised, the distribution of mitochondrial concentration has a lower median and much lower upper quartile than in the mitophagy-equipped Fzo1 mutant? What is going on here? For a paper motivated by disentangling coupled mechanisms, this should be made clearer!

      Thanks for pointing this out. We can of course easily include the control in the corresponding figure. Compromising mitophagy is likely to generally affect mitochondrial health and turnover a little bit, independent of what is going on with Fzo1. The second evidence that speaks against large-scale mitophagy is the proteomics data: On population level the dynamics of the respiratory chain proteins are very different from those of other (nuclear encoded) mitochondrial proteins. We will add additional supplementary figures to make this more clear, see Fig R1E. Most mitochondrial proteins in the proteomics experiment stay constant in the first few hours, consistent with the imaging data showing that the mean mitochondrial content of the population does not change initially. This again highlights that it is the unequal distribution which is the problem and not massive degradation of mitochondria.

      3.18. With the Atp6 signal, how do we know that fluorescence from different cells is comparable? Buds will be smaller than mother cells for example, potentially leading to less occlusion of the fluorescent signal by other content in the cytoplasm

      This is of course a general problem that anyone faces doing quantitative fluorescence microscopy. From the technical side, we have done the best we could by taking a reasonable amount of z-slices and by choosing fluorophores that are in a range with little cellular background fluorescence (e.g. Neongreen is much better than GFP). From a practical standpoint, we are always comparing to the control, which is subject to the same technical limitations as the depleted cells and the cell sizes are very similar. So, even if we are systematically overestimating the Atp6 concentration in the bud by a few %, the difference to the control would still be qualitatively true. We therefore do not think that any of our conclusions are affected by this.

      3.19. l343 -- maintenance of mtDNA -- here the point about l285 (is the Atp6-mtDNA relationship the same in the Fzo1 mutant) is particularly important, as we're directly tying findings about the protein product to implications about the mtDNA

      We will carefully address this, see above.

      3.20. l367 -- on a first read this description of the model feels like lots of choices have been made without being fully justified. Why a log-normal distribution (when the fit to the data looks rather flawed); why the choice of 5 groups for nucleoid number (why not 3? or 8?); the process used for parameter fitting is very unclear (after reading the methods I think some of these values are read directly from the data, but the shapes of the distributions remain unexplained). l705 -- presumably the ratio was drawn from a log-normal distribution and then the corresponding nucleoid numbers were rounded to integers? the ratio itself wasn't rounded? (also l367) How were the log-normal distributions fitted to experiments (Figs. S7A,B)? Just by eye?

      We will update our model based on measured nucleoid counts and then explain more stringently the choices we make/ parameters we select.

      3.21. l711 by random selection -- just at random? ("selection" could be confusing) Overall, it feels like the model may be too complicated for what it needs to show. Either (a) the model should show qualitatively that unequal inheritance and reduced production leads to rapid loss -- which a much simpler model, probably just involving a couple of lines of algebra, could show. Or (b) the model should quantitatively reproduce the particular numerical observations from the experiments -- it's not totally clear that it does this (do the cell-cycle-based decay timescales in Fig 7 correspond to the hour-based decay timescales in other plots, for example). At the moment the model is at a (b) level of detail but it's only clear that it's reporting the (a) level of results.

      If the HI-NESS and Fzo1 re-addition experiments work as explained above, all parameters will have direct experimental data, and we should get much closer to (a).

      3.22. A lot of the discussion repeats the results; depending on editorial preferences some of this text could probably be pared back to focus on the literature connections and context.

      We will think about streamlining the discussion once some of the additional material alluded to above has been added.

      3.23. Data availability -- it looks like much of the data required to reproduce the results is not going to be made available. Images and proteomic data are promised, but the data associated with mitochondrial concentration and other features are not mentioned. For FAIR purposes all the data (including statistics from analysis of the images) should be published.

      We maybe didn’t phrase this clearly. All data will be made available. Where technically feasible, this will be directly accessible in a repository, otherwise by request to the corresponding author.

      On our OMERO server, we have deposited many TB of raw images as well as all the intermediate steps such as segmentation masks, and the csv files with all the extracted data for each cell (including background corrections etc). Additionally, we can include csvs with the data grouped in a way that we used to generate all the box blots etc. As of now, the OMERO data is unfortunately only available by requesting a personal guest login from our bioinformatics facility, but we were promised that with the next technical update there will be a public link available. The proteomics data and the model are already fully accessible. The raw western blot images with corresponding ponceau staining will be included with the final publication either as additional supplementary material or in whatever format matches the journal requirements.

      3.24 l660 -- can an overview of the EM protocol be given, to avoid having to buy the Mayer 2024 article?

      The cited paper is open access. But we can also include more details in our method section.

      References:

      Chacko, L. A., H. Nakaoka, R. Morris, W. Marshall, and V. Ananthanarayanan. 2025. 'Mitochondrial function regulates cell growth kinetics to actively maintain mitochondrial homeostasis', bioRxiv.

      Christiano, R., N. Nagaraj, F. Frohlich, and T. C. Walther. 2014. 'Global proteome turnover analyses of the Yeasts S. cerevisiae and S. pombe', Cell Rep, 9: 1959-65.

      Contamine, V., and M. Picard. 2000. 'Maintenance and integrity of the mitochondrial genome: a plethora of nuclear genes in the budding yeast', Microbiol Mol Biol Rev, 64: 281-315.

      Deng, Jingti, Lucy Swift, Mashiat Zaman, Fatemeh Shahhosseini, Abhishek Sharma, Daniela Bureik, Francesco Padovani, Alissa Benedikt, Amit Jaiswal, Craig Brideau, Savraj Grewal, Kurt M. Schmoller, Pina Colarusso, and Timothy E. Shutt. 2025. 'A novel genetic fluorescent reporter to visualize mitochondrial nucleoids', bioRxiv: 2023.10.23.563667.

      Di Bartolomeo, F., C. Malina, K. Campbell, M. Mormino, J. Fuchs, E. Vorontsov, C. M. Gustafsson, and J. Nielsen. 2020. 'Absolute yeast mitochondrial proteome quantification reveals trade-off between biosynthesis and energy generation during diauxic shift', Proc Natl Acad Sci U S A, 117: 7524-35.

      Ebert, A. C., N. L. Hepowit, T. A. Martinez, H. Vollmer, H. L. Singkhek, K. D. Frazier, S. A. Kantejeva, M. R. Patel, and J. A. MacGurn. 2025. 'Sphingolipid metabolism drives mitochondria remodeling during aging and oxidative stress', bioRxiv.

      Jakubke, C., R. Roussou, A. Maiser, C. Schug, F. Thoma, R. Bunk, D. Horl, H. Leonhardt, P. Walter, T. Klecker, and C. Osman. 2021. 'Cristae-dependent quality control of the mitochondrial genome', Sci Adv, 7: eabi8886.

      Khan, Abdul Haseeb, Xuefang Gu, Rutvik J. Patel, Prabha Chuphal, Matheus P. Viana, Aidan I. Brown, Brian M. Zid, and Tatsuhisa Tsuboi. 2024. 'Mitochondrial protein heterogeneity stems from the stochastic nature of co-translational protein targeting in cell senescence', Nature Communications, 15: 8274.

      Martin, J., K. Mahlke, and N. Pfanner. 1991. 'Role of an energized inner membrane in mitochondrial protein import. Delta psi drives the movement of presequences', J Biol Chem, 266: 18051-7.

      Osman, C., T. R. Noriega, V. Okreglak, J. C. Fung, and P. Walter. 2015. 'Integrity of the yeast mitochondrial genome, but not its distribution and inheritance, relies on mitochondrial fission and fusion', Proc Natl Acad Sci U S A, 112: E947-56.

      Perić, Matea, Peter Bou Dib, Sven Dennerlein, Marina Musa, Marina Rudan, Anita Lovrić, Andrea Nikolić, Ana Šarić, Sandra Sobočanec, Željka Mačak, Nuno Raimundo, and Anita Kriško. 2016. 'Crosstalk between cellular compartments protects against proteotoxicity and extends lifespan', Scientific Reports, 6: 28751.

      Roussou, Rodaria, Dirk Metzler, Francesco Padovani, Felix Thoma, Rebecca Schwarz, Boris Shraiman, Kurt M. Schmoller, and Christof Osman. 2024. 'Real-time assessment of mitochondrial DNA heteroplasmy dynamics at the single-cell level', The EMBO Journal, 43: 5340-59-59.

      Seel, A., F. Padovani, M. Mayer, A. Finster, D. Bureik, F. Thoma, C. Osman, T. Klecker, and K. M. Schmoller. 2023. 'Regulation with cell size ensures mitochondrial DNA homeostasis during cell growth', Nat Struct Mol Biol, 30: 1549-60.

      Vowinckel, J., J. Hartl, R. Butler, and M. Ralser. 2015. 'MitoLoc: A method for the simultaneous quantification of mitochondrial network morphology and membrane potential in single cells', Mitochondrion, 24: 77-86.

    2. Note: This preprint has been reviewed by subject experts for Review Commons. Content has not been altered except for formatting.

      Learn more at Review Commons


      Referee #3

      Evidence, reproducibility and clarity

      This article addresses the connection between perturbed mitochondrial structure and genetics in yeast. When mitochondrial fusion is compromised, what is the chain of causality -- the mechanism -- that leads to mtDNA populations becoming depleted? This is a fascinating question, linking physical cell biology to population genetics. I admire the philosophy of the research, acknowledging and attempt to control for the many possible confounding influences. The manuscript describes the context and the research tightly and digestibly; the figures illustrate the results in a clear and natural way.

      For transparency, I am Iain Johnston and I am happy for this review to be treated as public domain. To my eyes my most important shortcoming as a review is my relative lack of familiarity with the yeast fzo1 mutant; while I am familiar with analysis of yeast mito morphology and mtDNA segregation, a reviewer familiar with the nuances of this strain and its culture would be a useful complement.

      I have a few more general points and a collection of smaller points below that I believe might help make the story more robust.

      General points

      1. About the use of Atp6 as a good proxy for mtDNA content. This is assumed from l285 onwards, based on a previous publication. As the link is fairly central to part of the paper's arguments, and the system in this study is being perturbed in several different ways, a stronger argument or demonstration that this link remains intact (and unchanged, as it is used in comparisons) would seem important.
      2. About confounding variables and processes. The study does an admirable job of being transparent and attempting to control for the many different influences involved in the physical-genetic link. But some remain less clearly unpacked, including some I think could be quite important. For example, there is a lot of focus on mito concentration -- but given the phenotypes are changing the sizes of cells, do concentration changes come from volume changes, mito changes, or both? In "ruling out" mitophagy -- a potentially important (and intuitive) influence, the argument is not presented as directly as it could be and it's not completely clear that it can in fact be ruled out in this way. There are a couple of other instances which I've put in the smaller points below.

      Smaller points

      l47 full genus name when it first appears

      l58 I may be wrong here, but I thought the petite phenotype more classically arises from mtDNA deletion mutations, not loss? The way this is phrased implies that mtDNA loss is [always] the cause. Whether I'm wrong on that point or not, the petite phenotype should be described and referenced.

      para starting l59 -- should mention for context that mitochondria in (healthy, wildtype) yeast are generally much more fused than in other organisms

      Fig 1C -- very odd choice of y-axis range! either start at zero or ensure that the data fill as much vertical space of the plot as possible

      l105 "wild-type like more tubular mitochondria" reads rather awkwardly. "more tubular mitochondria (as in the wild-type)"?

      l106 -- imaging artefacts? are mitos fragmenting because of photo stress? -- this is mentioned in l577-8 in the Methods, but the data from the growth rate and MMP comparison isn't given -- an SI figure would be helpful here. It would be reassuring to know that mito morphology wasn't changing in response to phototoxicity too.

      para l146 -- so this suggests mtDNA-encoded proteins have a very rapid turnover, O(hours) -- is this known/reasonable?

      section l189 -- it's hard to reason fully about these statistics of mitochondrial concentration given that the petite phenotype is fundamentally affecting overall cell volume. can we have details on the cell size distribution in parallel with these results? to put it another way -- how does mitochondrial amount per cell change?

      l199 the mean in Fig S3C certainly does change -- it increases, clearly relative both to control and to its initial value. rather than sweeping this under the carpet we should look in more detail to understand it (a consequence of the increased skew of the distribution)?

      para line 206 -- this doesn't make it clear whether your MMP signal is integrated over all mitochondria in the cell, or normalised by mitochondrial content? this matters quite a lot for the intepretation if the distributions of mitochondrial content are changing. reading on, this is even more important for para line 222. Reading further on, there is an equation on l612 that gives a definition, but it doesn't really clarify (apologies if I'm misunderstanding).

      l230 -- a point of personal interest -- low mito concentrations are connected to low "function" (MMP) and give extended division times -- this is interestingly exactly the model needed to reproduce observations in HeLa cells (https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1002416). That model went on to predict several aspects of downstream cellular behaviour -- it would be very interesting to see how compatible that picture (parameterised using HeLa observations) is with yeast!

      l239 "less mitochondria" -- a bit tricky but I'd say "fewer mitochondria" or "less mitochondrial content"

      Section l234 So here (and in Fig 4) the focus is on overall distributions of mitochondrial concentration in different cells (mother-to-be, mother, bud; gen 1, gen >1). But we've just seen that one effect of fzo1 is to broader the distribution of mitochondrial concentration across cells. Can't we look in more depth at the implications of this heterogeneity? For example in Fig 4F (which is cool) we look at the distribution of all fzo1 mothers-to-be, mothers, and buds. But this loses information about the provenance. For example, do mothers-to-be with extremely low mito concentrations just push everything to the bud, while mothers-to-be with high mito concentrations distribute things more evenly? It would seem very easy and very interesting to somehow subset the distribution of mothers-to-be by concentration and see how different subsets behave

      l285 -- experimental design -- do we know that Atp6 will continue to be a good proxy for functional mtDNA in the face of the perturbations provided by Fzo1 depletion? Especially if there is impact on the expression of mitoribosomes, the relationship between mtDNA and Atp6 may look rather different in the mutant?

      l290 -- ruled out mitophagy. This message could be much clearer. Comparing Fig S5C and Fig 3A side-by-side is a needlessly difficult task -- put Fig 3A into Fig S5. Then we see that when mitophagy is compromised, the distribution of mitochondrial concentration has a lower median and much lower upper quartile than in the mitophagy-equipped Fzo1 mutant? What is going on here? For a paper motivated by disentagling coupled mechanisms, this should be made clearer!

      With the Atp6 signal, how do we know that fluorescence from different cells is comparable? Buds will be smaller than mother cells for example, potentially leading to less occlusion of the fluorescent signal by other content in the cytoplasm

      l336 -- similar to the Jajoo et al. mechanism in fission yeast -- but are you talking about feedback control of the mtDNA or the protein (or mRNA) product?

      l343 -- maintenance of mtDNA -- here the point about l285 (is the Atp6-mtDNA relationship the same in the Fzo1 mutant) is particularly important, as we're directly tying findings about the protein product to implications about the mtDNA

      l367 -- on a first read this description of the model feels like lots of choices have been made without being fully justified. Why a log-normal distribution (when the fit to the data looks rather flawed); why the choice of 5 groups for nucleoid number (why not 3? or 8?); the process used for parameter fitting is very unclear (after reading the methods I think some of these values are read directly from the data, but the shapes of the distributions remain unexplained). l705 -- presumably the ratio was drawn from a log-normal distribution and then the corresponding nucleoid numbers were rounded to integers? the ratio itself wasn't rounded? (also l367) How were the log-normal distributions fitted to experiments (Figs. S7A,B)? Just by eye? l711 by random selection -- just at random? ("selection" could be confusing) Overall, it feels like the model may be too complicated for what it needs to show. Either (a) the model should show qualitatively that unequal inheritance and reduced production leads to rapid loss -- which a much simpler model, probably just involving a couple of lines of algebra, could show. Or (b) the model should quantitatively reproduce the particular numerical observations from the experiments -- it's not totally clear that it does this (do the cell-cycle-based decay timescales in Fig 7 correspond to the hour-based decay timescales in other plots, for example). At the moment the model is at a (b) level of detail but it's only clear that it's reporting the (a) level of results.

      A lot of the discussion repeats the results; depending on editorial preferences some of this text could probably be pared back to focus on the literature connections and context.

      Data availability -- it looks like much of the data required to reproduce the results is not going to be made available. Images and proteomic data are promised, but the data associated with mitochondrial concentration and other features are not mentioned. For FAIR purposes all the data (including statistics from analysis of the images) should be published.

      l660 -- can an overview of the EM protocol be given, to avoid having to buy the Mayer 2024 article?

      Significance

      This is a powerful and thoughtful study that provides a collection of new mechanistic insights into the link between physical and genetic properties of mitochondria in yeast. Cell biologists, geneticists, and the mitochondrial field will find this of potentially deep interest. Because of the mode and dynamics of inheritance in budding yeast, findings here may not be directly transferrable to other eukaryotes, but these insights are still of interest for researchers outside of yeast for their insight into how this well-studied system manages its mitochondrial populations.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      Summary:

      This is a manuscript describing outbreaks of Pseudomonas aeruginosa ST 621 in a facility in the US using genomic data. The authors identified and analysed 254 P. aeruginosa ST 621 isolates collected from a facility from 2011 to 2020. The authors described the relatedness of the isolates across different locations, specimen types (sources), and sampling years. Two concurrently emerged subclones were identified from the 254 isolates. The authors predicted that the most recent common ancestor for the isolates can be dated back to approximately 1999 after the opening of the main building of the facility in 1996. Then the authors grouped the 254 isolates into two categories: 1) patient-to-patient; or 2) environment-to-patient using SNP thresholds and known epidemiological links. Finally, the authors described the changes in resistance gene profiles, virulence genes, cell wall biogenesis, and signaling pathway genes of the isolates over the sampling years.

      Strengths:

      The major strength of this study is the utilisation of genomic data to comprehensively describe the characteristics of a long-term Pseudomonas aeruginosa ST 621 outbreak in a facility. This fills the data gap of a clone that could be clinically important but easily missed from microbiology data alone.

      Weaknesses:

      The work would further benefit from a more detailed discussion on the limitations due to the lack of data on patient clinical information, ward movement, and swabs collected from healthcare workers to verify the transmission of Pseudomonas aeruginosa ST 621, including potential healthcare worker to patient transmission, patient-to-patient transmission, patient-to-environment transmission, and environment-to-patient transmission. For instance, the definition given in the manuscript for patient-to-patient transmission could not rule out the possibility of the existence of a shared contaminated environment. Equally, as patients were not routinely swabbed, unobserved carriers of Pseudomonas aeruginosa ST 621 could not be identified and the possibility of misclassifying the environment-to-patient transmissions could not be ruled out. Moreover, reporting of changes in rates of resistance to imipenem and cefepime could be improved by showing the exact p-values (perhaps with three decimal places) rather than dichotomising the value at 0.05. By doing so, readers could interpret the strength of the evidence of changes.

      Impact of the work:

      First, the work adds to the growing evidence implicating sinks as long-term reservoirs for important MDR pathogens, with direct infection control implications. Moreover, the work could potentially motivate investments in generating and integrating genomic data into routine surveillance. The comprehensive descriptions of the Pseudomonas aeruginosa ST 621 clones outbreak is a great example to demonstrate how genomic data can provide additional information about long-term outbreaks that otherwise could not be detected using microbiology data alone. Moreover, identifying the changes in resistance genes and virulence genes over time would not be possible without genomic data. Finally, this work provided additional evidence for the existence of long-term persistence of Pseudomonas aeruginosa ST 621 clones, which likely occur in other similar settings.

      We thank the reviewer for their thorough evaluation of our work, and for the suggested improvements. A main goal of this study was to show that integrating routine wgs in the clinic was a game changer for infection control efforts. We appreciate this aspect was highlighted as a strength by this reviewer. While some of the weaknesses identified are inherent to the data (or lack thereof) available for this study, we have revised the manuscript to include a detailed discussion on limitations (sampling, thresholds of genetic relatedness, definition and categories etc.) that could influence the genomic inferences. We also provided exact p-values for the changes in rates of resistance, as requested. Finally, we have positively answered all the specific recommendations suggested by the reviewer and modified the manuscript accordingly.

      Reviewer #2 (Public Review):

      Summary:

      The authors present a report of a large Pseudomonas aeruginosa hospital outbreak affecting more than 80 patients with first sampling dates in 2011 that stretched over more than 10 years and was only identified through genomic surveillance in 2020. The outbreak strain was assigned to the sequence type 621, an ST that has been associated with carpabapenem resistance across the globe. Ongoing transmission coincided with both increasing resistance without acquisition of carbapenemase genes as well as the convergence of mutations towards a host-adapted lifestyle.

      Strengths:

      The convincing genomic analyses indicate spread throughout the hospital since the beginning of the century and provide important benchmark findings for future comparison.

      The sampling was based on all organisms sent to the Multidrug-resistant Organism Repository and Surveillance Network across the U.S. Military Health System.

      Using sequencing data from patient and environmental samples for phylogenetic and transmission analyses as well as determining recurring mutations in outbreak isolates allows for insights into the evolution of potentially harmful pathogens with the ultimate aim of reducing their spread in hospitals.

      Weaknesses:

      The epidemiological information was limited and the sampling methodology was inconsistent, thus complicating the inference of exact transmission routes. Epidemiological data relevant to this analysis include information on the reason for sampling, patient admission and discharge data, and underlying frequency of sampling and sampling results in relation to patient turnover.

      We thank the reviewer for their thoughtful feedback on our manuscript and for highlighting the quality of the genomic analyses. We agree that the lack of patient epi data (e.g. date of admission and discharge) and the inconsistent sampling through the years are limitations of this study. We have revised the manuscript to acknowledge these limitations and discuss how not having this data complicates the inference of exact transmission routes. Finally, we have positively answered all the specific recommendations suggested by the reviewer and modified the manuscript accordingly.

      Reviewer #3 (Public Review):

      Summary:

      This paper by Stribling and colleagues sheds light on a decade-long P. aeruginosa outbreak of the high-risk lineage ST-621 in a US Military hospital. The origins of the outbreak date back to the late 90s and it was mainly caused by two distinct subclones SC1 and SC2. The data of this outbreak showed the emergence of antibiotic resistance to cephalosporin, carbapenems, and colistin over time highlighting the emerging risk of extensively resistant infections due to P. aeruginosa and the need for ongoing surveillance.

      Strengths:

      This study overall is well constructed and clearly written. Since detailed information on floor plans of the building and transfers between facilities was available, the authors were able to show that these two subclones emerged in two separate buildings of the hospital. The authors support their conclusions with prospective environmental sampling in 2021 and 2022 and link the role of persistent environmental contamination to sustaining nosocomial transmission. Information on resistance genes in repeat isolates for the same patients allowed the authors to detect the emergence of resistance within patients. The conclusions have broader implications for infection control at other facilities. In particular, the paper highlights the value of real-time surveillance and environmental sampling in slowing nosocomial transmission of P. aeruginosa.

      Weaknesses:

      My major concern is that the authors used fixed thresholds and definitions to classify the origin of an infection. As such, they were not able to give uncertainty measures around transmission routes nor quantify the relative contribution of persistent environmental contamination vs patient-to-patient transmission. The latter would allow the authors to quantify the impact of certain interventions. In addition, these results represent a specific US military facility and the transmission patterns might be specific to that facility. The study also lacked any data on antibiotic use that could have been used to relate to and discuss the temporal trends of antimicrobial resistance.

      We thank the reviewer for their evaluation of our work and for highlighting the broad implications of our findings regarding the application of real-time surveillance to suppress nosocomial transmission. We agree with the reviewer that fixed thresholds and definitions are imperfect to classify the origin of an infection. The design of this study (e.g. inconsistent sampling through time) was not conducive to provide a comprehensive/quantitative measurement of transmission routes. Thus, we decided to apply conservative thresholds of genetic relatedness and strict conditions (e.g. time between isolate collection, shared hospital location etc.) to favor specificity as our goal was simply to establish that cases of environmentto-patient transmission did happen. In the absence of a truth set, we have not performed sensitivity analysis, but we are conducting a follow-up study to compare inferences from MCMC models to our original fixed-thresholds predictions. This limitation is now discussed in the revised manuscript. Finally, we have positively answered all the specific recommendations suggested by the reviewer and modified the manuscript accordingly including the addition of Figure S3.

      Reviewer #1 (Recommendations For The Authors):

      The definitions used on lines 391-396 are necessarily somewhat arbitrary, but it would be helpful to have a little bit more justification for the choices made, particularly for the definition of environmental involving the "3x the number of years they were separated". It seems a little hard to square this with the more relaxed 10 SNP cutoff for a patient-to-patient designation. Are there reasons for thinking SNP differences associated with environmental transmission should be smaller than for patient-to-patient, or is the aim here just to set the bar higher for assuming an environmental source? Because these definitions are quite arbitrary, there could also be some value in exploring the sensitivity of the results to these assumptions.

      Thank you. We agree with the reviewers that SNP thresholds, albeit necessarily, are arbitrary and that more discussion/justification was needed to put the genomic inferences in context. We have revised the manuscript to indicate that: 1/ the 10 SNP cutoff for a patient-to-patient designation was set to account for the known evolution rate of P. aeruginosa (inferred by BEAST at 2.987E-7 subs/site/year in this study and similar to previous estimates PMID: 24039595) and the observed within host variability (now displayed in revised Fig. 1E). We note that this SNP distance was not sufficient and that an epi link (patients on the same ward at the same time) needed to be established. 2/ the environment-to-patient definition was indeed set to be most conservative (nearly identical isolates in two patients from the same ward with no known temporal overlap for > 365 days). This was indeed done to favor high specificity as this inference relied solely on clinical isolates (i.e. the identical environmental strain in the patientenvironment-patient chain was not sampled). For these clinical isolates to have acquired no/very little mutation in that much time, no/low replication is expected and, although unsampled, we propose this most likely happened on hospital surfaces.

      While the term "core genome" should be familiar to most readers, "shell genome" and "cloud genome" are less widely known, and an explanation of what these terms mean here would be helpful.

      Thank you. We have revised the manuscript to define the core, shell, and cloud genomes as genes sets found in ≥ 99%, ≥ 95% and ≥ 15% of isolates, respectively.

      In the first paragraph of the discussion, it could be added that in many cases for clinically important Gram negatives short read sequencing alone will fail to detect transmission events as outbreaks can be driven by plasmid spread with only very limited clonal spread (see, for example, https://www.nature.com/articles/s41564-021-00879-y )

      Thank you. We agree this is an important/emerging aspect of surveillance. However, the goal of this discussion point was to explain why such a large outbreak was missed prior to implementing WGS (short read) surveillance. We feel that discussing “plasmid outbreaks” (which is not at play here, and relatively rare in P. aeruginosa compared to the Enterobacteriaceae) and the need for long read will distract from the narrative. 

      line 599 What does "Mock" mean here? Would it be more accurate to say it is a simplified floor plan?

      Thank you. “Mock” was changed to “simplified”

      IPAC abbreviation is only used once - spelling it out in full would increase readability.

      Revised manuscript was edited as suggested.

      MHS is only used twice.

      Revised manuscript was edited to spell out Military Health System

      Line 364: full stop missing.

      Revised manuscript was edited as suggested.

      Line 401: Bayesian rather than bayesian.

      Revised manuscript was edited as suggested.

      Reviewer #2 (Recommendations For The Authors):

      Thank you for giving me the opportunity to review this interesting manuscript.

      The conclusions of this paper are mostly well supported by the data presented, but epidemiological information was limited and the sampling methodology was inconsistent, thus complicating inference of exact transmission routes.

      Major issues:

      What was the baseline frequency of clinical and/or screening samples of Pseudomonas aeruginosa at the hospital? Neither Figure 1D nor Table S1 allows for differentiating between clinical and screening samples. Most isolates were cultured from clinical materials, and there is no information about the patients' length of stay and their respective sampling dates. Is there any possibility of finding out whether the samples were collected for clinical or screening purposes? Would it be possible to include the patients' admission data to determine whether the strains were imported into the hospital or related to a previous stay, e.g. among known carriers? Also, the issue of sampling dates vs. patient stay on the ward should be addressed, as there may be an overlap in patients' stay on the ward but no overlap in terms of sampling dates or even missing samples (missing links).

      We have revised the manuscript to address this important point: i) 16 isolates were from surveillance swabs and are labelled “Surveillance” in Table S1. The remaining 237 were clinical isolates; ii) unfortunately, because the sampling was done under a public health surveillance framework, we do not have access to historical patient data (admission/discharge date, wards, rooms, etc.) and we can not calculate length of stay or better identify patient overlap. These limitations are now acknowledged in the discussion of the revised manuscript.

      In order to evaluate the extent of the outbreak, more epidemiological data would be useful What is the size of the hospital, what is the average patient turnover, and what is the average length of stay in ICU and non-ICU? Is there any specialization besides the military label?

      We have revised the manuscript to indicate that facility A is 425-bed medical center and is the only Level 1 trauma center in the Military Health System. Unfortunately, the data to calculate length of stay, throughout the years, in ICU and non-ICU, was not available to us. This limitation is now also acknowledged in the discussion.

      Perhaps the authors could attempwt to discuss the extent to which large outbreaks like these may be considered as part of unavoidable evolutionary processes within the hospital microbiome as opposed to accumulation and transmission of potentially harmful genes/clones, and differentiate between the putative community spread without any epidemiological links on the one hand, and hospital outbreaks that could be targeted by local infection prevention activities on the other hand.,

      We respectfully disagree with the suggestion that this large outbreak “may be considered as part of unavoidable evolutionary processes within the hospital microbiome” and should be opposed to “transmission of potentially harmful genes/clones”. As a matter of fact, our data showed that infection control staff at Facility A responded with multiple interventions, including closing sinks, replacing tubing, and using foaming detergents. This resulted in slowing the spread of the ST621 outbreak with just 3 cases identified in 2022, 0 cases in 2023 and 1 case in 2024. This is now discussed in the revised manuscript.

      Page 5, lines 88-92 lines 101-104. It seems as if the outbreak was identified only by the means of genomic surveillance. This raises questions as to the rationale for sampling and sequencing, especially prior to 2020. Considering 11 cases per year between 2011 and 2016, one could assume such an outbreak would have been noticed without sequencing data.

      The MRSN was created in 2010, in response to the outbreak of MDR Acinetobacter baumannii in US military personnel returning from Iraq and Afghanistan. Between 2011 and 2017, the MRSN collected MDR isolates (mandate for all MDR ESKAPE but compliance varied between years and facilities) from across the Military Health System and, for select isolates (e.g. high-risk isolates carrying ESBLs or carbapenemases) performed molecular typing by PFGE. In 2017 the MRSN started to perform whole genome sequencing of its entire repository. In 2020, a routine prospective sequencing service was started and first detected the ST621 outbreak. A retrospective analysis of historical isolate genomes (2011-2019) identified additional cases. The first paragraph of the discussion lists possible factors to explain why the ST621 escaped detection by traditional approaches. We believe 11 cases per year is not a strong signal when stratified by month, wards, or both, especially for a clone lacking a carbapenemase and without a remarkable antibiotic susceptibility profile. 

      Did the infection control personnel suspect transmission? If yes, was the sampling and submission of samples to the MRSN adapted based on the epidemiologic findings?

      The ST621 outbreak was unsuspected before the initial genomic detection in 2020. Until that point, MDR isolates only (Magiorakos et al PMID: 21793988) were collected but compliance was variable through time. Quickly thereafter (starting in 2021), complete sampling of all clinical P. aeruginosa (MDR or not) from Facility A was started. The manuscript was revised to clarify those details of the sampling strategy.

      Is there any information about how many environmental sites were sampled without evidence of ST621 / screening samples were cultured without evidence of Pseudomonas aeruginosa?

      For patient isolates, only 16 isolates were from surveillance swabs. The remaining 237 were clinical isolates. No denominator data was available to calculate P. aeruginosa and ST-621 positivity rate in surveillance swabs throughout the time period. For environmental isolates, a total of 159 swabs were taken from 55 distinct locations in 8 wards/units including the ER. This data is now included in the revised manuscript. However, a complete analysis of these swabs (positivity rate for ESKAPE pathogens, P. aeruginosa, per ward/floor/room, per swab type (sink drain, bed rail etc.) etc.) is beyond the scope of this study and is being performed as a follow up investigation.

      Page 5 lines 89 and 39 Figure S1B. Please describe how the allelic distance for the cluster threshold was selected.

      As indicated in the legend of Figure S1B, no thresholds were applied. All ST621 isolates ever sequenced by the MRSN were included. All except 3 isolates shared between 023 cgMLST allelic differences. The remaining 3 were distant by 88-89 allelic differences. The text was revised to clarify this point.

      Page 5 lines 99-100. Could the authors please provide some distribution measures (e.g. IQR).

      Done as requested. The revised manuscript now reads “…of just 38 single nucleotide polymorphisms (SNPs), and an IQR of 19 (Fig. 1A, Table S1).”

      Page 5 line 102. Could the authors please provide some distribution measures (e.g. IQR).

      Please see above. A chart was created and is now included as Fig. S2.

      Page 6 line 107 and page 34 figure 1c. In the text it is stated that isolates were collected in 27 wards, the figure 1C depicts 26 wards and n/a.

      Thank you for spotting this inconsistency. This has been fixed in the revised manuscript.

      Page 6 lines 117-118. Samples collected in the emergency room would imply samples collected on admission, already addressed previously. Did the authors investigate a potential import into the hospital from community reservoirs or were all these isolates collected among patients who had been previously admitted to the hospital and/or tested positive for the outbreak strain?

      We agree that samples collected in the ER imply samples collected on admission. Of the 29 ER isolates only 9 (31%) were primary isolates (first detection in a new patient) which suggests a majority were from returning patients at Facility A. Because the sampling was done under a public health surveillance framework, we do not have access to historical patient data (admission/discharge date, wards, rooms, etc.) to investigate/confirm that these 9 patients had previous visits at Facility A. This point is now discussed in the revised manuscript.

      Page 6 line 128. This could also represent increased selective pressure. However, according to Table S1, the 28 isolates collected in 2011 (the number does not match with Figure 1D) were from many different wards, thus indicating earlier spread throughout the hospital.

      Yes, we agree. Please note that table S1 lists all isolates for 2011 whereas Figure 1D focuses on primary (first isolate from each patients) only.  

      Page 7 line 133. Both Figure 2 and the discussion section, page 13 line 296 suggest the year 2005 instead of 2004?

      Thank you for catching this typographical error. This was corrected to 2004 in the revised manuscript.

      Figure 1E. The figure should also depict intra-patient diversity for comparison.

      Thank you for this great suggestion. We have revised Figure 1E accordingly.

      Page 7, lines 146-147 Could the authors attempt explaining the upper part of the bimodal peaks?

      This is an all-vs-all SNP analysis for all inter-patient isolates. For each isolates all distances to other isolates are reported, not only the smallest. The upper peaks represent comparisons to isolates from a different outbreak subclone (SC1 vs SC2).

      Page 7, line 150 This is a very small number considering the extent of the outbreak and suggests a large number of missing links. Or does this rather imply continuous import and evolution over time that does not necessarily represent transmission within the hospital?

      We believe all cases were due to transmission happening within the hospital. Based on conservative thresholds (genetic relatedness and epi link, or lack thereof) the precise origin from another patient (n=10) or a contaminated surface (n=12) can be inferred. For the remaining 60 patients, with the available sampling, the conditions we chose are not met and we simply do not conclude whether a direct patient-to-patient or an environmental origin was more likely.

      Page 8 line 155. What does the temporal overlap refer to - sampling date versus patient's stay on the ward? Please specify.

      The temporal overlap was investigated from sampling dates, as dates of patient admission/discharged were not available.

      Page 8, line 157: What does primary/serial isolate mean - first and follow-up samples of ST621 per patient?

      Yes. Primary isolate is used to designate the first isolate from a patient. Serial isolates designate follow-up samples of ST621.

      Page 8 line 165: Table S3 and Figure 3 only refer to environmental samples from three wards. Ward 20 rooms 2 and 18 as well as ward 1 rooms 1 and 6 were hotspots - is there any information on the specific infection control/disinfection measures? Addressed in discussion page 12, lines 273-275, but no information on what was actually done.

      The manuscript was revised to indicate the precise disinfection measures that were taken. A follow-up study is ongoing to assess long-term efficacy and monitor possible retrograde growth from previously contaminated sinks.

      Page 8 line 175: Evaluation of change in resistance fraction over time - There may have been a selection bias with an inconsistent number of strains sequenced per year.

      Yes, incomplete sampling and possible selection bias are now listed with other limitations of this study in the discussion of the revised manuscript.

      Page 9 line 183: The referral to Table S1 is unclear, I could not find the number and the specific isolates selected for long-read sequencing.

      Thank you. This has been added to the revised Table S1.

      Page 10 lines 217-225 and Figure 4C: Perhaps it is possible to better align what is written in the text and the caption of the figure. The caption does not clarify that only one patient develops colistin resistance (what was the reason to include the other patients?).

      Thank you. We have revised the text and the caption of the figure to clarify that only isolates from one patient developed colistin resistance. The isolates from the other patients on Fig. 4C are shown to provide context and accurately map the emergence of the PhoQE77fs mutation.  

      Page 10, lines 228-229 and Table S5: How is it possible to identify those 64 genes in Table S5?

      We have revised Table S5 to facilitate the identification of the 64 genes with ≥ 2 independently acquired mutations (excluding SYN). Specifically, we have added column E labeled “Counts independent mutations per locus (excluding SYN)”. A total of 205 rows (in this table each row is a variant) have a value ≥ 2 and these represent 64 genes (upon deduplication of locus tags).  

      Page 13, lines 280-281: Where is the information on chronic infection presented? Serial cultures would not necessarily mean chronic infection.

      Authors response: Yes, we agree this was not the appropriate characterization and this was revised to ‘long-term’ infections.

      Page 14 line 306: Emergence of colistin resistance in a single patient, correct?

      Yes. This was further clarified in the text.

      Page 14 lines 315-320: This should go to the results section. In particular disinfection, closing, and replacing of tubing should be mentioned in the results section in reference to the results presented in Table S3.

      Thank you. We have considered this suggestion and have decided to leave this discussion as the closing paragraph of this publication. A follow-up study is ongoing to assess long-term efficacy of these interventions on the ST-621 bur also other outbreak clones at Facility A.

      Methods

      Page 15 lines 330-333: Perhaps it is possible to avoid redundancy.

      Thank you. We have revised the text accordingly.

      Page 15 lines 341: Information on which isolates were subjected to long-read sequencing is missing.

      Thank you. This has been added to the revised Table S1.

      Page 16 line 345: Was there a particular reason why Newbler was chosen?

      No. At the time Newbler was the default assembler built in the MRSN bacterial genome analysis pipeline and QC processes.

      Page 16, line 357-358: What was the rationale for selecting this isolate as reference genome?

      This isolate was chosen because it was collected early in the outbreak and phylogenetic analysis revealed it had low root to tip divergence.

      Page 16 line 361: Why 310 isolates, if only 253 were assigned to the outbreak clone and only a subset of those were collected in facility A?

      This was a typographical error that has corrected (it now reads “…set of 253 isolates.”) in the revised manuscript.  

      Page 17 lines 387-395: What is the reason that intra-patient diversity was not included in the set of criteria for SNP distances?

      The observed within host variability (now displayed in revised Fig. 1E) was taken into consideration when setting SNP thresholds for categorizing patient-to-patient transmission or environment-to-patient event. This is now clarified in the revised manuscript.

      Page 17 line 392: How was the threshold of <=10 SNPs determined?

      The 10 SNP cutoff to infer a patient-to-patient transmission event was set to account for the known evolution rate of P. aeruginosa (inferred by BEAST at 2.987E-7 subs/site/year in this study, and similar to previous estimates PMID: 24039595) and the observed within host variability (now displayed in revised Fig. 1E). We note that this SNP distance was not sufficient and that an epi link (patients on the same ward within the same month) needed to be established.

      Page 17 line 395 and Figure 2: What was the assumed average mutation rate per genome per year?

      Thank you. The mean substitution rate inferred by BEAST was 2.987E-7 similar to estimate from previous studies on P. aeruginosa outbreaks (e.g. PMID: 24039595).

      Reviewer #3 (Recommendations For The Authors):

      Please find (line-by-line comments) on each section of the manuscript below:

      Introduction

      Line 86: I am wondering why the authors state ">28 facilities" instead of the exact number of facilities from which these lineages were recovered.

      Thank you. Manuscript was revised to provide the exact number of facilities. It now reads “…recovered from 37 and 28 facilities, respectively.”

      Methods

      It's not clear to me which criteria were used for collecting these isolates (both prospective and retrospective). I understand that some of the data are described in more detail in Lebreton et al but I did not find the specific criteria for the collection of the isolates and I imagine that these might differ if different facilities. Would it be possible to comment on that and add a short paragraph in the Methods section?

      Thank you. This lack of clarity was also raised by other reviewers, and we have revised the manuscript to indicate that: 1/MDR isolates only (Magiorakos et al PMID: 21793988) were collected from 2011-2020 with the same criteria for all facilities although compliance was variable through time and between facilities; and 2/ starting in 2021 all P. aeruginosa isolates, irrespective of their susceptibility profile, were collected from Facility A

      The data comes from a US Military hospital. Is this related to the US Veterans Affairs Healthcare system? Is there more detailed information about the demographics of the patient population?

      Facility A is part of the Military Health System (MHS) which provides care for active service members and their families. This is distinct from the US Veterans Affairs Healthcare system. Only limited patient data was accessible to us as this study was done as part of our public health surveillance activities. Patient age (avg. 57.2 +/- 21.0) and gender (ratio male/female 1.7) are provided in the revised manuscript. 

      Line 384ff: The origin of infection was inferred based on the SNP threshold and epidemiological links. However, recombination events can complicate the interpretation of SNP data. Have the authors attempted to account for this?

      Thank you. We agree that recombination events can complicate the interpretation of SNP data. We used Gubbins v2.3.1 to filter out recombination from the core SNP alignment, as indicated in the revised manuscript.

      The authors' definition of environment-to-patient transmission seems conservative (nearly identical strain and no known temporal overlap for > 365 days). Have the authors changed the threshold, performed sensitivity analyses, and tested how this would affect their results?

      Indeed, acknowledging that fixed thresholds have limitations in their ability to accurately predict the origin of infections, we took a conservative approach to favor specificity as our goal was simply to establish that cases of environment-to-patient transmission did happen. In the absence of a truth set, we have not performed sensitivity analysis, but we are conducting a follow-up study to compare inferences from MCMC models to our original predictions. This limitation is now discussed in the revised manuscript.

      The authors don't seem to incorporate the role of healthcare workers in the transmission process. Could they comment on this? I am assuming that environment-to-patient transmission could either be directly from the environment to the patient or via a healthcare worker. I think it's fine to make simplifying assumptions here but it would be great if this was explicitly described.

      Thank you for this suggestion. We have not sampled the hands of healthcare workers in this study. As a result, the reviewer is correct to say that we made the simplifying assumption that healthcare workers would be possible intermediates in either environment-topatient or patient-to-patient transmissions, as previously described by others (PMID: 8452949). This limitation is now discussed in the revised manuscript.

      Page 5, line 100: What does "all vs all" mean? Based on the supplement, I assume it's the pairwise distance and then averaged across all of those. It would improve the readability of the manuscript if the authors could briefly define this term and then maybe refer to Table S1.

      Thank you. We have created Fig.S2 and revised the manuscript to state that ST-621 isolates from facility A belonged to the same outbreak clone with a distance (averaged all vs all pairwise comparison) of just 38 single nucleotide polymorphisms (SNPs), and an IQR of 19 (Fig. S2, Table S1).

      Figure 1D: It would be interesting to see additional figures in the supplement on the percentage of sequenced isolates per year and whether it varies across the different sources/sites. Is there any information on which isolates were chosen for sequencing?

      Lack of clarity in the sampling/sequencing scheme was raised by multiple reviewers and we have provided a thorough response to earlier comments. We also have revised the material and methods section accordingly. Finally, we have created Fig. S3 to show the percentage of sequenced isolates per year across different sources/sites, as suggested by the reviewer. No noticeable patterns were observed. 

      It seems like only a subset of all clinical isolates were sequenced. Would it be possible that SC2 was present already earlier but not picked up until a certain date?

      Although all isolates received by the MRSN were sequenced, compliance varied through time so it is true that not all clinical isolates were sequenced between 2011-2019. As such, we fully agree with this hypothesis and discuss this possibility as BEAST analysis placed the origin of SC2 in 2004 while the first detection of an SC2 isolate was in December 2012. This limitation is now discussed in the revised manuscript.

      Could the authors elaborate on whether the isolates resulted from single-colony picks? Is it possible that the different absence of a subclone is due to the fact that they picked only a colony?

      Yes, the isolates resulted from single-colony picks except when the presence of different colony morphologies was noted. In the latter, representative isolates for each colony morphologies were processed. We have revised the methods to make that clear.

      Figure 2: It is difficult to see which nodes belong to which patient due to the small font size. I wonder if it was possible to color the nodes for each patient, to make it more readable.

      We tried coloring the nodes but with > 60 distinct patients/colors we decided it did not improve clarity. We have revised figure 2 to increase the font size.  

      Page 7-8, lines 154-155: Did the authors check whether there were isolates of the same strain (that were found in the environment) present in other patients elsewhere in the ward?

      Yes. In rare cases, we observed virtually genetically identical isolates from two patients collected in different wards. Because we only have access to clinical isolate data (collected from patient X in ward Y) and do not have access to patient data (admission/discharge date, wards, rooms, etc.), we do not know but cannot exclude that patients overlap in a room prior to the sampling of their P. aeruginosa isolates. We designed our fixed thresholds to be conservative. As a result, in this analysis, these cases are labelled as “undetermined”.  

      Page 8: Do the authors have any information on antibiotic use during this timeframe? From the discussion, it seems like there is no patient-level prescription data. Is there any data on overall trends? How were trends in antibiotic use correlated with trends in antibiotic resistance?

      Unfortunately, patient-level prescription data (or any other data not linked to the bacterial specimens) was not accessible to us as this study was done as part of our public health surveillance activities.

      To infer the origin of infection, the authors used a static method with fixed thresholds and definitions. This study does not provide any uncertainty with their estimates. Maybe the authors could add a sentence in the discussion section that MCMC methods to infer transmission trees incorporating WGS could provide these estimates. These methods have not been applied to PA a lot but two examples where MCMC methods have been used without WGS (though the definition of environmental contamination may differ between these studies and this study).

      https://doi.org/10.1186/s13756-022-01095-x

      https://doi.org/10.1371/journal.pcbi.1006697

      Thank you for this great suggestion. We have revised the manuscript to include a discussion on the limitations of fixed thresholds to infer transmission chains/origins, and to discuss existing alternatives including MCMC methods. 

      Line 322-323: This sentence is a bit vague since not all of these HAI are due to P. aeruginosa. I would suggest citing a number that is specific to PA.

      Thank you. While our paper shows a particular example of protracted P. aeruginosa outbreak, the roll-out of routine WGS surveillance in the clinic will help prevent hospital-associated drug-resistant infections for more than this species. We believe that broadening the scope in the last sentence of the manuscript is important and we decline to revise as suggested.

    1. Good night, ladies, good night, sweet ladies, good night, good night.

      It's interesting to me how Eliot ends this section of The Waste Land with Ophelia's last words before she commits suicide. Lines before, we get references to "Bill," "Lou," and "May," indicating that the speaker is bidding farewell from the pub setting. Ophelia's line, on the other hand, bids farewell on behalf of not just Lil and the woman in the pub, but all the "sweet ladies" of the waste land. This idea of death as a fate is super interesting. The women have their emotional and spiritual deaths connected to Ophelia's physical death. This is yet another instance where we see suicide in a female in The Waste Land. If I think about what Eliot is trying to get at with women x waste land, especially with this Ophelia connection, I'd say the waste land is a world where the modes of expressing experiences like song, symbol, and even madness have been stripped of their meaning and beauty, leaving only bad nerves, dirty gossip, and the last call of the pub. This is obviously not the ideal place for women; hence, modern society is not fit for women to flourish.

    1. One critique of all of these approaches, however, is that no design, no matter how universal, will equally serve everyone. This is the premise of design justice44 Costanza-Chock, S. (2020). Design justice: Community-led practices to build the worlds we need. MIT Press. , which observes that design is fundamentally about power, in that designs may not only serve some people less well, but systematically exclude them in surprising, often unintentional ways.

      I agree with this. I am privileged to often forget about the exclusion of certain groups in "universal" designs. An example of this that I thought of was pens. I found out recently that a lot of left-handed people have a hard time with ink pens as there palms tend to smear the wet ink immediately after writing. Another example I could think of were the original Band-Aid colors, and how they did a poor job of representing people of all skin tones. Any design that leaves out a certain group of people should always have a substitute version for those people or should not be designed altogether.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review): 

      Summary: 

      In this manuscript entitled "Molecular dynamics of the matrisome across sea anemone life history", Bergheim and colleagues report the prediction, using an established sequence analysis pipeline, of the "matrisome" - that is, the compendium of genes encoding constituents of the extracellular matrix - of the starlet sea anemone Nematostella vectensis. Re-analysis of an existing scRNA-Seq dataset allowed the authors to identify the cell types expressing matrisome components and different developmental stages. Last, the authors apply time-resolved proteomics to provide experimental evidence of the presence of the extracellular matrix proteins at three different stages of the life cycle of the sea anemone (larva, primary polyp, adult) and show that different subsets of matrisome components are present in the ECM at different life stages with, for example, basement membrane components accompanying the transition from larva to primary polyp and elastic fiber components and matricellular proteins accompanying the transition from primary polyp to the adult stage. 

      Strengths: 

      The ECM is a structure that has evolved to support the emergence of multicellularity and different transitions that have accompanied the complexification of multicellular organisms. Understanding the molecular makeup of structures that are conserved throughout evolution is thus of paramount importance. 

      The in-silico predicted matrisome of the sea anemone has the potential to become an essential resource for the scientific community to support big data annotation efforts and understand better the evolution of the matrisome and of ECM proteins, an important endeavor to better understand structure/function relationships. This study is also an excellent example of how integrating datasets generated using different -omic modalities can shed light on various aspects of ECM metabolism, from identifying the cell types of origins of matrisome components using scRNA-Seq to studying ECM dynamics using proteomics. 

      We greatly appreciate the positive feedback regarding the design of our study and the evolutionary significance of our findings.

      Weaknesses: 

      My concerns pertain to the three following areas of the manuscript: 

      (1) In-silico definition of the anemone matrisome using sequence analysis: 

      a) While a similar computational pipeline has been applied to predict the matrisome of several model organisms, the authors fail to provide a comprehensive definition of the anemone matrisome: In the text, the authors state the anemone matrisome is composed of "551 proteins, constituting approximately 3% of its proteome (see page 6, line 14), but Figure 1 lists 829 entries as part of the "curated" matrisome, Supplementary Table S1 lists the same 829 entries and the authors state that "Here, we identified 829 ECM proteins that comprise the matrisome of the sea anemone Nematostella vectensis" (see page 17, line 10). Is the sea anemone matrisome composed of 551 or 829 genes? If we refer to the text, the additional 278 entries should not be considered as part of the matrisome, but what is confusing is that some are listed as glycoproteins and the "new_manual_annotation" proposed by the authors and that refer to the protein domains found in these additional proteins suggest that in fact, some could or should be classified as matrisome proteins. For example, shouldn't the two lectins encoded by NV2.3951 and NV2.3157 be classified as matrisome-affiliated proteins? Based on what has been done for other model organisms, receptors have typically been excluded from the "matrisome" but included as part of the "adhesome" for consistency with previously published matrisome; the reviewer is left wondering whether the components classified as "Other" / "Receptor" should not be excluded from the matrisome and moved to a separate "adhesome" list. 

      In addition to receptors, the authors identify nearly 70 glycoproteins classified as "Other". Here, does other mean "non-matrisome" or "another matrisome division" that is not core or associated? If the latter, could the authors try to propose a unifying term for these proteins? Unfortunately, since the authors do not provide the reasons for excluding these entries from the bona fide matrisome (list of excluding domains present, localization data), the reader is left wondering how to treat these entries. 

      Overall, the study would gain in strength if the authors could be more definitive and, if needed, even propose novel additional matrisome annotations to include the components for now listed as "Other" (as was done, for example, for the Drosophila or C. elegans matrisomes). 

      The reviewer is correct to point out the confusing terminology used throughout our manuscript, where both the total of 829 proteins constituting the curated list of ECM domain proteins and the actual matrisome (excluding "others") were referred to as "matrisomes". In general, we followed the example set by Naba & Hynes in their 2012 paper (Mol Cell Proteomics. 2012 Apr;11(4):M111.014647. doi: 10.1074/mcp.M111.014647), where they define the "matrisome" as encompassing all components of the extracellular matrix ("core matrisome") and those associated with it ("matrisome-associated" proteins). This corresponds to our group of 551 proteins, comprising both core matrisome and matrisomeassociated proteins. The Naba & Hynes paper also contains the inclusive and exclusive domain lists for the matrisome that we applied for our dataset. In the revised manuscript, we have now labelled the group of 829 proteins as "curated ECM domain proteins/genes", which includes all proteins positively selected for containing a bona fide ECM domain. After excluding non-matrisomal proteins such as receptors, we arrive at the 551 proteins that constitute the "Nematostella matrisome". We have maintained this terminology throughout the revised manuscript and have revised Figures 1B and 4B accordingly.

      Regarding the category of "other" proteins, which by definition are not part of the matrisome although containing ECM domains, we have taken the reviewer's advice and classified these in more detail. We categorized all receptors as "adhesome" (202 proteins).  The remaining group of “other” secreted ECM domain proteins were then further subcategorized. Those exhibiting significant matches in the ToxProt database were subclassified as "putative venoms" (15 proteins). This group also includes the two lectins (NV2.3951 and NV2.3157), which had been originally shifted to the “other” category due to their classification as venoms. We categorized as “adhesive proteins” (28 proteins) factors such as coadhesins that due to their domain architecture resemble bioadhesive proteins described in proteomic studies of other invertebrate species, such as corals or sponges (see also https://doi.org/10.1016/j.jprot.2022.104506). Further sub-categories are stress/injury response proteins (9 proteins) and ion channels (6 proteins). The remaining 17 proteins were categorized as “uncharacterized ECM domain proteins”. These include highly diverse proteins possessing either single ECM domains or novel domain combinations. We decided to retain those in our dataset as candidates for future functional characterization.

      b) It is surprising that the authors are not providing the full currently accepted protein names to the entries listed in Supplementary Table S1 and have used instead "new_manual_annotation" that resembles formal protein names. This liberty is misleading. In fact, the "new_manual_annotation" seems biased toward describing the reason the proteins were positively screened for through sequence analysis, but many are misleading because there is, in fact, more known about them, including evidence that they are not ECM proteins. The authors should at least provide the current protein names in addition to their "new_manual_annotations". 

      c) To truly serve as a resource, the Table should provide links to each gene entry in the Stowers Institute for Medical Research genome database used and some sort of versioning (this could be added to columns A, B, or D). Such enhancements would facilitate the assessment of the rigor of the list beyond the manual QC of just a few entries. 

      d) Since UniProt is the reference protein knowledge database, providing the UniProt IDs associated with the predicted matrisome entries would also be helpful, giving easy access to information on protein domains, protein structures, orthology information, etc. 

      e) In conclusion, at present, the study only provides a preliminary draft that should be more rigorously curated and enriched with more comprehensive and authoritative annotations if the authors aspire the list to become the reference anemone matrisome and serve the community. 

      Table S1 has been updated to include links to the respective Stowers Institute IDs (first two columns), as well as SwissProt IDs and current descriptions from both the Stowers Institute (SI) and Swissprot.

      In our manual annotations, we prioritized these over automated ones due to the considerable effort invested in examining each sequence individually. The cnidaria-specific minicollagens and NOWA proteins might serve as an example. According to the SI descriptions, the minicollagens are annotated as “keratin-associated protein, predicted or hypothetical protein, collagen-like protein and pericardin”. We classified these as minicollagens on the basis of overall domain architecture and of signature domains and sequence motifs, such as minicollagen cysteine-rich domains (CRDs) and polyproline stretches (doi: 10.1016/j.tig.2008.07.001). NOWA is a CTLD/CRD-containing protein that is part of nematocyst tubules (doi:10.1016/j.isci.2023.106291). The first two NOWA isoforms, according to Si descriptions, were annotated as aggrecan and brevican core proteins, which is very misleading. We therefore feel that our manual annotations better serve the cnidarian research community in classifying these proteins.

      Automated annotations of ECM proteins often rely on similarities between individual domains, neglecting overall domain composition. For example, Swissprot descriptions annotate 31 TSP1 domain-containing proteins in our list as "Hemicentin-1", but closer inspection reveals that only one sequence (NV2.24790) qualifies as Hemicentin-1 due to its characteristic vWFA, Ig-like, TSP1, G2 nidogen, and EGF-like domain architecture. Regarding novel protein annotations, NV2.650 might serve as an example. While SI descriptions annotate this protein as "epidermal growth factor" based on the presence of several EGF-like domains, our analysis reveals two integrin alpha N-terminal domains that classify this sequence as integrin-related. We have therefore assigned a description (Secreted integrin-N-related protein) that references this defining domain and avoids misclassification within the EGF family.

      In cases where the automated annotation (including those in Genbank) matched our own findings, we adopted the existing description, as seen with netrin-1 (NV2.7734). We acknowledge that our manual annotations are not flawless and will be refined by future research. Nonetheless, we offer them as an approximation to a more accurate definition of the identified protein list.

      (2) Proteomic analysis of the composition of the mesoglea during the sea anemone life cycle: 

      a) The product of 287 of the 829 genes proposed to encode matrisome components was detected by proteomics. What about the other ~550 matrisome genes? When and where are they expressed? The wording employed by the authors (see line 11, page 13) implies that only these 287 components are "validated" matrisome components. Is that to say that the other ~550 predicted genes do not encode components of the ECM? This should be discussed. 

      Obviously, our wording was not sufficiently accurate here. In the revised Fig. 1B we indicated that 210 of the 551 matrisome (core and associated) proteins were confirmed by mass spectrometry. In total, 287 proteins were identified by mass spectrometry, meaning that 77 of those are non-matrisomal proteins belonging to the “adhesome” (47) and “other” (30) groups. The fact that the remaining 542 proteins of the matrisome predicted by our in silico analysis could not be identified has two major reasons: (1) Our study was focussed on the molecular dynamics of the mesoglea. Therefore, only mesogleas were isolated for the mass spectrometry analysis and nematocysts were mostly excluded by extensive washing steps. As nematocysts contribute significantly to the predicted matrisome, this group of proteins is underrepresented in the mass spectrometry analysis. (2) A significant fraction of the predicted ECM proteins constitutes soluble factors and transmembrane receptors. These might not be necessarily part of the mesoglea isolates. In addition, the isolation and solubilization method we applied might have technical limitations. Although we used harsh conditions for solubilizing the mesoglea samples (90°C and high DTT concentrations), we cannot exclude that we missed proteins which resisted solubilization and thus trypsinization. We confirmed that all genes predicted by the in silico analysis have transcriptomic profiles as demonstrated in supplementary table S4. We have clarified these points in the revised results part (p.6) and also revised the statement in line 16, page 13.

      b) Can the authors comment on how they have treated zero TMT values or proteins for which a TMT ratio could not be calculated because unique to one life stage, for example? 

      We did not include these proteins in the analysis of the respective statistical comparison. This involved only very few proteins (about 10).  

      c) Could the authors provide a plot showing the distribution of protein abundances for each matrisome category in the main figure 4? In mammals, the bulk of the ECM is composed of collagens, followed by fibrillar ECM glycoproteins, the other matrisome components being more minor. Is a similar distribution observed in the sea anemone mesoglea? 

      We have included such a plot showing protein abundances across life stages and protein categories (Fig. 4A). Collagens and basement membrane proteoglycans (perlecan) are the most abundant protein categories in the core matrisome while secreted factors dominate in the matrisome-associated group.

      d) Prior proteomic studies on the ECM of vertebrate organisms have shown the importance of allowing certain post-translational modifications during database search to ensure maximizing peptide-to-spectrum matching. Such PTMs include the hydroxylation of lysines and prolines that are collagen-specific PTMs. Multiple reports have shown that omitting these PTMs while analyzing LC-MS/MS data would lead to underestimating the abundance of collagens and the misidentification of certain collagens. The authors may want to reanalyze their dataset and include these PTMs as part of their search criteria to ensure capturing all collagen-derived peptides. 

      Thank you for this suggestion. We have re-analyzed our dataset including lysine and proline hydroxylation as PTM. While we obtained in total 70 more proteins using this approach, this additional group did not contain any large collagen or minicollagen we had not detected before. We only obtained two additional collagen-like proteins with very short triple helical domains (V2t013973001.1, NV2t024002001.1), one being a fragment. We don’t feel this justifies implementing a re-analysis of the proteome in our study.

      e) The authors should ensure that reviewers are provided with access to the private PRIDE repository so the data deposited can also be evaluated. They should also ensure that sufficient meta-data is provided using the SRDF format to allow the re-use of their LCMS/MS datasets. 

      We apologize for not providing the reviewer access in our initial submission and have asked the editorial office to forward the PRIDE repository link to all reviewers immediately after receiving the reviews. We did upload a metadata.csv file with the proteomics dataset. This file contains an annotation of all TMT labels to the samples and conditions and replicates used in the manuscript. It contains similar information as an SRDF format file. In addition, the search output files on protein and psm level have been provided. So, from our point of view, we provided all necessary information to reproduce the analysis.

      (3) Supplementary tables: 

      The supplementary tables are very difficult to navigate. They would become more accessible to readers and non-specialists if they were accompanied by brief legends or "README" tabs and if the headers were more detailed (see, for example, Table S2, what does "ctrl.ratio_Larvae_rep2" exactly refer to? Or Table S6 whose column headers using extensive abbreviations are quite obscure). Similarly, what do columns K to BX in Supplementary Table S1 correspond to? Without more substantial explanations, readers have no way of assessing these data points. 

      We have revised the tables and removed any redundant data columns. We also included detailed explanations of the used abbreviations, both in the headers and in a separate README file. Some of the information was apparently lost during the conversion to pdf files. We will therefore upload the original .xls files when submitting the revised manuscript.

      Reviewer #2 (Public review): 

      This work set out to identify all extracellular matrix proteins and associated factors present within the starlet sea anemone Nematostella vectensis at different life stages. Combining existing genomic and transcriptomic datasets, alongside new mass spectometry data, the authors provide a comprehensive description of the Nematostella matrisome. In addition, immunohistochemistry and electron microscopy were used to image whole mount and decellularized mesoglea from all life stages. This served to validate the de-cellularization methods used for proteomic analyses, but also resulted in a very nice description of mesoglea structure at different life stages. A previously published developmental cell type atlas was used to identify the cell type specificity of the matrisome, indicating that the core matrisome is predominantly expressed in the gastrodermis, as well as cnidocytes. The analyses performed were rigorous and the results were clear, supporting the conclusions made by the authors. 

      Thank you. We greatly appreciate the positive assessment of our study.

      Reviewer #3 (Public review): 

      Summary: 

      This manuscript by Bergheim et al investigates the molecular and developmental dynamics of the matrisome, a set of gene products that comprise the extracellular matrix, in the sea anemone Nematostella vectensis using transcriptomic and proteomic approaches. Previous work has examined the matrisome of the hydra, a medusozoan, but this is the first study to characterize the matrisome in an anthozoan. The major finding of this work is a description of the components of the matrisome in Nematostella, which turns out to be more complex than that previously observed in hydra. The authors also describe the remodeling of the extracellular matrix that occurs in the transition from larva to primary polyp, and from primary polyp to adult. The authors interpret these data to support previously proposed (Steinmetz et al. 2017) homology between the cnidarian endoderm with the bilaterian mesoderm. 

      Strengths: 

      The data described in this work are robust, combining both transcriptome and proteomic interrogation of key stages in the life history of Nematostella, and are of value to the community. 

      Thank you for your positive assessment of our dataset. 

      Weaknesses: 

      The authors offer numerous evolutionary interpretations of their results that I believe are unfounded. The main problem with extending these results, together with previous results from hydra, into an evolutionary synthesis that aims to reconstruct the matrisome of the ancestral cnidarian is that we are considering data from only two species. I agree with the authors' depiction of hydra as "derived" relative to other medusozoans and see it as potentially misleading to consider the hydra matrisome as an exemplar for the medusozoan matrisome. Given the organismal and morphological diversity of the phylum, a more thorough comparative study that compares matrisome components across a selection of anthozoan and medusozoan species using formal comparative methods to examine hypotheses is required. 

      Specifically, I question the author's interpretation of the evolutionary events depicted in this statement: 

      "The observation that in Hydra both germ layers contribute to the synthesis of core matrisome proteins (Epp et al. 1986; Zhang et al. 2007) might be related to a secondary loss of the anthozoan-specific mesenteries, which represent extensions of the mesoglea into the body cavity sandwiched by two endodermal layers." 

      Anthozoans and medusozoans are evolutionary sisters. Therefore, the secondary loss of "anthozoan-like mesenteries" in hydrozoans is at least as likely as the gain of this character state in anthozoans. By extension, there is no reason to prefer the hypothesis that the state observed in Nematostella, where gastroderm is responsible for the synthesis of the core matrisome components, is the ancestral state of the phylum. Moreover, the fossil evidence provided in support of this hypothesis (Ou et al. 2022) is not relevant here because the material described in that work is of a crown group anthozoan, which diversified well after the origin of Anthozoa. The phylogenetic structure of Cnidaria has been extensively studied using phylogenomic approaches and is generally well supported (Kayal et al. 2018; DeBiasse et al. 2024). Based on these analyses, anthozoans are not on a "basal" branch, as the authors suggest. The structure of cnidarian phylogeny bifurcates with Anthozoa forming one clade and Medusozoa forming the other. From the data reported by Bergheim and coworkers, it is not possible to infer the evolutionary events that gave rise to the different matrisome states observed in Nematostella (an anthozoan) and hydra (a medusozoan). Furthermore, I take the observation in Fig 5 that anthozoan matrisomes generally exhibit a higher complexity than other cnidarian species to be more supportive of a lineage-specific expansion of matrisome components in the Anthozoa, rather than those components being representative of an ancestral state for Cnidaria. Whatever the implication, I take strong issue with the statement that "the acquisition of complex life cycles in medusozoa, that are distinguished by the pelagic medusa stage, led to a secondary reduction in the matrisome repertoire." There is no causal link in any of the data or analyses reported by Bergheim and co-workers to support this statement and, as stated above, while we are dealing with limited data, insufficient to address this question, it seems more likely to me that the matrisome expanded in anthozoans, contrasting with the authors' conclusions. While the discussion raises many interesting evolutionary hypotheses related to the origin of the cnidarian matrisome, which is of vital interest if we are to understand the origin of the bilaterian matrisome, a more thorough comparative analysis, inclusive of a much greater cnidarian species diversity, is required if we are to evaluate these hypotheses. 

      DeBiasse MB, Buckenmeyer A, Macrander J, Babonis LS, Bentlage B, Cartwright P, Prada C, Reitzel AM, Stampar SN, Collins A, et al. 2024. A Cnidarian Phylogenomic Tree Fitted With Hundreds of 18S Leaves. Bulletin of the Society of Systematic Biologists [Internet] 3. Available from: https://ssbbulletin.org/index.php/bssb/article/view/9267

      Epp L, Smid I, Tardent P. 1986. Synthesis of the mesoglea by ectoderm and endoderm in reassembled hydra. J Morphol [Internet] 189:271-279. Available from: https://pubmed.ncbi.nlm.nih.gov/29954165/ 

      Kayal E, Bentlage B, Sabrina Pankey M, Ohdera AH, Medina M, Plachetzki DC, Collins AG, Ryan JF. 2018. Phylogenomics provides a robust topology of the major cnidarian lineages and insights on the origins of key organismal traits. BMC Evol Biol [Internet] 18:1-18. Available from: https://bmcecolevol.biomedcentral.com/articles/10.1186/s12862-018-1142-0

      Ou Q, Shu D, Zhang Z, Han J, Van Iten H, Cheng M, Sun J, Yao X, Wang R, Mayer G. 2022. Dawn of complex animal food webs: A new predatory anthozoan (Cnidaria) from Cambrian. The Innovation 3:100195 

      Steinmetz PRH, Aman A, Kraus JEM, Technau U. 2017. Gut-like ectodermal tissue in a sea anemone challenges germ layer homology. Nature Ecology & Evolution 2017 1:10 [Internet] 1:1535-1542. Available from: https://www.nature.com/articles/s41559-017-0285-5

      Zhang X, Boot-Handford RP, Huxley-Jones J, Forse LN, Mould AP, Robertson DL, Li L, Athiyal M, Sarras MP. 2007. The collagens of hydra provide insight into the evolution of metazoan extracellular matrices. J Biol Chem [Internet] 282:6792-6802. Available from: https://pubmed.ncbi.nlm.nih.gov/17204477/ 

      We agree with the reviewer that only the analysis of several additional anthozoan and medusozoan representatives will yield a valid basis for a reconstruction of the ancestral cnidarian matrisome and allow statements about ancestral or novel features within the phylum. We have therefore revised our statements in the discussion part of the manuscript by implementing the cited literature and also findings from medusozoan genome analysis (e.g. Gold et al., 2018) demonstrating that changes in gene content are as common in the anthozoans as in medusozoans, which questioned the previously stated “basal” state of Nematostella or of anthozoans in general.

      Reviewer #1 (Recommendations for the authors): 

      (1) In Figure 2A, an "o" is missing in the labeling of the "developing cnidcytes" population. 

      Thank you, we have corrected the typo.

      (2) It would be helpful to have the different life stages indicated as headers of the heat maps presented in Figure 4. 

      We have included symbolic representations for the different life stages on top of the heat maps in addition to the respective labels at the bottom.

      Reviewer #2 (Recommendations for the authors): 

      Important changes: 

      (1) Figure 2B The x-axis tissue names should be changed to something more easily readable/understandable - some are clear, but others are not. Perhaps abbreviations could be expanded in the legend. 

      We have expanded the legend in Fig. 2B to render it more easily readable. We have also rotated the maps in A to have them aligned with the ones in Fig.3B.

      (2) Figure 3B This figure would be improved by the inclusion of cluster names, to understand better the mapping. 

      We have added relevant cluster names to Fig. 3B and as stated above aligned the orientation of the maps in Fig. 2B and Fig. 3B.

      (3) Figure 3C As with 2B, I find the y-axis cnidocyte cell state names to be unclear at times. Perhaps abbreviations could be expanded in the legend. 

      All abbreviations were expanded in Fig.3C axis labels.

      (4) Many of the supplementary tables are not well exported or easily readable as is (gene names are truncated, headers truncated, etc), which means that they may not be easily usable by researchers in the field interested in following up on this work in other contexts. Indeed, to be more usable, please consider sharing these supplementary data as .csv files, for example, instead of as .pdfs. 

      We are sorry for this inconvenience, which was obviously caused by the conversion to pdf files. We will upload the original csv files when submitting the revised manuscript.

      Smaller nitpicky comments: 

      (5) Page 2 line 4 & page 3 line 7: Please consider a term other than "pre-bilaterian". The drawing/ordering of a phylogeny of extant species is not meaningful in terms of more or less ancestral. e.g. if the tips are flipped in the drawing of the tree, can we say that bilaterians are pre-cnidarians? What does that mean? 

      We have used that term on the basis that cnidarians existed before the appearance of bilaterians according to the fossil record and molecular phylogenies (McFadden et al., 2021; Adoutte et al., 2000;Cavalier-Smith et al., 1996; Collins, 1998; Kim et al., 1999; Medina et al., 2001; Wainright et al., 1993). To acknowledge remaining uncertainties in the timing of origin of animals, we will use the term “early-diverging metazoans” instead, which is widely accepted in the cnidarian community. 

      (6) Page 3 line 9 I was confused by the use of "gastrula-shaped body" to describe cnidarians, which are on the whole very morphologically diverse and don't all resemble gastrulae (that can also be quite diverse). 

      This term is sometimes used to refer to the diploblastic cnidarian body plan (outer ectoderm, inner endoderm) with a mouth that corresponds to the blastopore. To avoid misunderstandings, we changed it in the revised manuscript to “Cnidarians, the sister group to bilaterians, are characterized by a simple body plan with a central body cavity and a mouth opening surrounded by tentacles.”

      Reviewer #3 (Recommendations for the authors): 

      (1) In general, I felt there was a lot of discussion about protein structure and diversity that is difficult to follow without a figure. I think some of the information in Supplementary Figures S5, S9, and S11 should be in the main figures. 

      Following the reviewer’s suggestion, we have integrated Fig. S5 (collagens) into the main Fig. 2 and Fig. S9 (polydoms) into Fig. 4. As metalloproteases are not extensively discussed in the manuscript (and also due to the large size of the figure) we have kept Fig. S11 as a supplementary figure.

      (2) Page 3, Line 7: The use of the term "pre-bilaterian" is inappropriate. Cnidarians and bilaterians are evolutionary sisters. Therefore, each lineage derives from the same split and is the same age. The cnidarian lineage is not older than the bilaterian lineage. 

      Following a similar request by reviewer 2 we have replaced this term by “early diverging metazoans”.

      (3) Page 5, Line 10. How were in silico matrisomes from early-branching metazoan species predicted? 

      We applied the same bioinformatic pipeline as for the Nematostella matrisome. We clarified this in the respective methods part.

      (4) Page 16, Line 8: This should be Thus. 

      Obviously, the wording of this sentence was ambiguous. We changed it to ”In contrast, the adult mesoglea is significantly enriched in elastic fiber components, such as fibrillins and fibulin. This compositional shift likely adds to the visco-elastic properties (Gosline 1971a, b) of the growing body column (Fig. 4B,D, supplementary table S7).”

    1. Author response:

      The following is the authors’ response to the current reviews.

      Reviewer 1:

      While BAP1 mutant UM cell lines were included for some of the experiments, it seems the in-vivo data mentioned in the response to the reviewers comment is missing? The authors stated that "MP46 (Supplementary Fig. 3a) is BAP1-null uveal melanoma cell line with no detectable protein expression (Amirouchene-Angelozzi et al., Mol Oncol 2014), and we have observed strong tumor growth inhibition in this CDX model with our BAF ATPase inhibitor." But the CDX model data shown in Figure 4 is from 92.1 cells. If this data is available, then the manuscript would benefit from its addition.

      We thank the reviewer for bringing this to our attention. As the reviewer mentioned, we show 92-1 CDX model in our manuscript. Additionally, strong tumor growth inhibition in MP-46  CDX model treated with our BAF ATPase inhibitor can be found in Vaswani et al., 2025 (PMID:39801091, https://pubmed.ncbi.nlm.nih.gov/39801091/).

      Reviewer 3:<br /> Supplementary Figure 2C<br /> Is the T910M mutation in the parental MP41 cells heterozygous? If so, the authors should indicate this in the figure legend. If this is a homozygous mutation, the authors should explain how the inhibitors suppress SMARCA4 activity in cells that have a LOF mutation.

      We thank the reviewer for bringing this to our attention. We updated the figure legend accordingly to reflect the genotype of the mutations highlighted in the table.


      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      Summary:

      The presented study by Centore and colleagues investigates the inhibition of BAF chromatin remodeling complexes. The study is well-written, and includes comprehensive datasets, including compound screens, gene expression analysis, epigenetics, as well as animal studies. This is an important piece of work for the uveal melanoma research field, and sheds light on a new inhibitor class, as well as a mechanism that might be exploited to target this deadly cancer for which no good treatment options exist.

      Strengths:

      This is a comprehensive and well-written study.

      Weaknesses:

      There are minimal weaknesses.

      We thank the reviewer for the positive comments.

      Reviewer #2 (Public Review):

      Summary:

      The authors generate an optimized small molecule inhibitor of SMARCA2/4 and test it in a panel of cell lines. All uveal melanoma (UM) cell lines in the panel are growth-inhibited by the inhibitor making the focus of the paper. This inhibition is correlated with the loss of promoter occupancy of key melanocyte transcription factors e.g. SOX10. SOX10 overexpression and a point mutation in SMARCA4 can rescue growth inhibition exerted by the SMARCA2/4 inhibitor. Treatment of a UM xenograft model results in growth inhibition and regression which correlates with reduced expression of SOX10 but not discernible toxicity in the mice. Collectively the data suggest a novel treatment of uveal melanoma.

      Strengths:

      There are many strengths of the study including the strong challenge of the on-target effect, the assays used, and the mechanistic data. The results are compelling as are the effects of the inhibitor. The in vivo data is dose-dependent and doses are low enough to be meaningful and associated with evidence of target engagement.

      Weaknesses:

      The authors introduce the field stating that SMARCA4 inhibitors are more effective in SMARCA2 deficient cancers and the converse. Since the desirable outcome of cancer therapy would be synthetic lethality it is not clear why a dual inhibitor is desirable. Wouldn't this be associated with more side effects? It is not known how the inhibitor developed here impacts normal cells, in particular T cells which are essential for any durable response to cancer therapies in patients. Another weakness is that the UM cell lines used do not molecularly resemble metastatic UM. These UM most frequently have mutations in the BAP1 tumor suppressor gene. It is not clear if the described SMARCA2/4 inhibitor is efficacious in BAP1 mutant UM cell lines in vitro or BAP1 mutant patient-derived xenografts in vivo.

      We thank the reviewer for their insightful and constructive comments. As we demonstrate in Fig. 1d, uveal melanoma cells are selectively and deeply sensitive to BAF ATPase inhibition, and provides a therapeutic window. This is confirmed in Fig. 4a-c, as we demonstrated robust tumor growth inhibition, achieved at a dose well-tolerated in xenograft study. FHD-286, a dual BRM/BRG1 inhibitor similar to FHT-1015 with optimized physical properties, has been evaluated in a Phase I trial in patients with metastatic uveal melanoma (NCT04879017) and manuscript describing results of this clinical trial is currently in preparation.

      As the reviewer mentioned, BAP1 loss is a signature of metastatic uveal melanoma. MP38 is a BAP1 mutant uveal melanoma cell line, and we demonstrated growth inhibition and robust caspase 3/7 activity in response to FHT-1015 (Supplementary Fig. 3a and 3f). MP46 (Supplementary Fig. 3a) is BAP1-null uveal melanoma cell line with no detectable protein expression (Amirouchene-Angelozzi et al., Mol Oncol 2014), and we have observed strong tumor growth inhibition in this CDX model with our BAF ATPase inhibitor.

      Reviewer #3 (Public Review):

      Summary:

      This manuscript reports the discovery of new compounds that selectively inhibit SMARCA4/SMARCA2 ATPase activity that work through a different mode as previously developed SMARCA4/SMARCA2 inhibitors. They also demonstrate the anti-tumor effects of the compounds on uveal melanoma cell proliferation and tumor growth. The findings indicate that the drugs exert their effects by altering chromatin accessibility at binding sites for lineage-specific transcription factors within gene enhancer regions. In uveal melanoma, altered expression of the transcription factor, SOX10, and SOX10 target gene underlies the anti-proliferative effects of the compounds. This study is significant because the discovery of new SMARCA4/SMARCA2 inhibitory compounds that can abrogate uveal melanoma tumorigenicity has therapeutic value. In addition, the findings provide evidence for the therapeutic use of these compounds in other transcription factor-dependent cancers.

      Strengths:

      The strengths of this manuscript include biochemical evidence that the new compounds are selective for SMARCA4/SMARCA2 over other ATPases and that the mode of action is distinct from a previously developed compound, BRM014, which binds the RecA lobe of SMARCA2. There is also strong evidence that FHT1015 suppresses uveal melanoma proliferation by inducing apoptosis. The in vivo suppression of tumor growth without toxicity validates the potential therapeutic utility of one of the new drugs. The conclusion that FHT1015 primarily inhibits SMARCA4 activity and thereby suppresses chromatin accessibility at lineage-specific enhancers is substantiated by ATAC-seq and ChIP-seq studies.

      Weaknesses:

      The weaknesses include a lack of more precise information on which SMARCA4/SMARCA2 residues the drugs bind. Although the I1173M/I1143M mutations are evidence that the critical residues for binding reside outside the RecA lobe, this site is conserved in CHD4, which is not affected by the compounds. Hence, this site may be necessary but not sufficient for drug binding or specifying selectivity. A more precise evaluation of the region specifying the effect of the new compounds would strengthen the evidence that they work through a novel mode and that they are selective. Another concern is that the mechanisms by which FHT1015 promotes apoptosis rather than simply cell cycle arrest are not clear. Does SOX10 or another lineage-specific transcription factor underlie the apoptotic effects of the compounds?

      We thank the reviewer for the valuable comments.

      We believe that our dual ATPase inhibitor is selective and additional insights into binding specificity and selectivity for earlier stage compounds of this series were recently published in Vaswani et al., 2025 (PMID:39801091, https://pubmed.ncbi.nlm.nih.gov/39801091/).

      The reviewer also poses a great question regarding the mechanism of apoptosis. The mechanism of apoptosis is extremely complex, but we observed a decrease in pro-survival BCL-2 protein expression in response to FHT-1015, in the experiment corresponding to Supplementary Fig. 5e. In the experiment described in Fig. 3k, we also monitored caspase 3/7 activity over time, and SOX10 overexpression rescued 92-1 cells from FHT-1015 induced apoptosis. This suggests the role of SOX10 as an important mediator of response to BAF ATPase inhibition, including apoptosis induced by FHT-1015.

      Additional Reviews:

      The referees would like to draw the authors' attention to the following issues that would best benefit from additional revision. 

      The clinical relevance of the study would be strengthened by the use of uveal melanoma cell lines with BAP1 mutations that better represent metastatic uveal melanoma. The use of patient-derived xenografts would also be pertinent and would be a useful addition. Similarly, attention to the effects of the inhibitor on non-cancerous proliferative cells such as blood/T/immune cells would also strengthen the manuscript. As the study reports the administration of one of the inhibitors in mice for the xenograft experiments, it would be important to assess any potential effects on blood cell counts and better discuss the eventual toxicity or lack of toxicity and how it was assessed. 

      The authors should better explain how SOX10 over expression can rescue viability in the presence of the inhibitor. Similarly given the critical roles of BRG1, SOX10, and MITF in cutaneous melanoma some specific discussion on the sensitivity of cutaneous melanoma cells to the inhibitor should be considered, and potential differences with uveal melanoma highlighted. 

      Aside from these issues, the authors are urged to consider the other points mentioned below. 

      Reviewer #1 (Recommendations For The Authors): 

      Figure 1d, as well as the text in the manuscript referring to this figure, would benefit from indicating specific cell lines used for UM. The same for the sentence in line 153. 

      We thank the reviewer for bringing this to our attention. We have added the cell line names and updated the manuscript accordingly.

      For any of the studies conducted, is there any link with the genetics of UM? E.g. BAP1 wildtype/BAP1 mutant? 

      As addressed above in the public review section, MP38 is a BAP1 mutant uveal melanoma cell line, and we demonstrated growth inhibition and robust caspase 3/7 activity in response to FHT-1015 (Supplementary Fig. 3a and 3f). MP46 (Supplementary Fig. 3a) is BAP1-null uveal melanoma cell line with no detectable protein expression (Amirouchene-Angelozzi et al., Mol Oncol 2014), and we have observed strong tumor growth inhibition in this CDX model with our BAF ATPase inhibitor.

      Row 191 - How were peaks classified as enhancer-occupied? 

      We used annotatePeaks function of HOMER package to annotate genomic locations, as well as H3K27ac ChIP-seq to annotate peaks as enhancer-occupied. We thank the reviewer to pointing it out and have updated the manuscript accordingly to include this information.

      Row 259, the two cell lines should be named, also in Figure 3i. 

      We have added the cell line names and updated the manuscript accordingly.

      Reviewer #2 (Recommendations For The Authors): 

      As a proof of concept, this study is truly excellent and the authors should be commended. However, it is desirable that new knowledge in cancer is translated to the clinic. To this end there are a few things needed to strengthen the study. 

      I am rephrasing my statements from the public review to say that I would recommend testing the inhibitor in T cells (side effects) and BAP1 mutant cell lines (for clinical relevance). 

      As addressed in the public review section, MP38 is a BAP1 mutant uveal melanoma cell line, and we demonstrated growth inhibition and robust caspase 3/7 activity in response to FHT-1015 (Supplementary Fig. 3a and 3f). MP46 (Supplementary Fig. 3a) is BAP1-null uveal melanoma cell line with no detectable protein expression (Amirouchene-Angelozzi et al., Mol Oncol 2014), and we have observed strong tumor growth inhibition in this CDX model with our BAF ATPase inhibitor.

      Regarding concerns for any potential side effect on T cells, we observed an increase in both CD4 and CD8 T-cell populations in the peripheral blood and the spleen, when naïve, non-tumor bearing CD-1 mice were dosed with SMARCA2/4 dual ATPase inhibitor FHD-286 once daily for 14 days. FHD-286 is a compound similar to FHT-1015 described in Vaswani et al., 2025 (PMID:39801091, https://pubmed.ncbi.nlm.nih.gov/39801091/). In addition, FHD-286 has been tested in tumor bearing syngeneic models. When B16F10 tumor bearing C57BL/6 were dosed with FHD-286 for 10 days, we observed an increase in CD69+ activated CD8 T-cell infiltration in the tumor microenvironment (doi:10.1136/jitc-2022-SITC2022.0888).

      Reviewer #3 (Recommendations For The Authors): 

      (1) Determine drug binding by crystal structure or generate additional SMARCA4 or SMARCA2 mutations in the region near I1173/I1143 that are not conserved in CHD4 and test them in an ATPase assay for effects on drug inhibition. For example, Q1166 in SMARCA4 and Q1136 in SMARCA4 could be changed to Alanine as in CHD4. Would this abrogate drug inhibition? 

      We believe that our dual ATPase inhibitor is selective and additional insights into binding specificity and selectivity for earlier stage compounds of this series were recently published in Vaswani et al., 2025 (PMID:39801091, https://pubmed.ncbi.nlm.nih.gov/39801091/).

      (2) The finding that SOX10 can rescue the antiproliferative effects of FHT1015 suggests that SMARCA4 is primarily needed for SOX10 expression. However, the co-occupancy of SMARCA4 and SOX10 at enhancers suggests that they cooperate to promote chromatin accessibility. It is unclear how over-expression of SOX10 can promote chromatin accessibility in drug-inhibited cells since SOX10 does not have chromatin remodeling activity. ATAC-seq in cells over-expressing SOX10 and treated with the drug could identify SOX10-dependent targets that do not require SMARCA4 activity and clarify the mechanism. It would also be informative to determine if SOX10 over-expression abrogates the effects of FHT1015 on both cell cycle and apoptosis, helping to resolve whether it is a partial or complete rescue of proliferation. 

      We agree that running ATAC-seq in cells overexpressing SOX10 would clarify this mechanism. However, shifts in corporate strategy deprioritized any further experiments for this project. One potential mechanism that SOX10 overexpression can partially rescue BAF inhibition phenotype is through overexpressed SOX10 localizing to open chromatin regions (mostly promoters) across the genome. We know from our ATAC-seq data (Fig. 2) that BAF inhibition leads to loss of chromatin accessibility at SOX10 enhancer sites, while promoter regions are only partially affected. Therefore, we think that overexpression of SOX10 would allow upregulation of its target genes via binding to the promoter regions. In this model, the enhancer-driven SOX10 target genes are likely to remain silenced.  

      (3) Although the in vivo studies indicate that the drugs are well-tolerated, additional in vitro studies to determine the effects of the drug on the proliferation/survival of non-cancerous cells would further validate their therapeutic utility.

      Author Response: The reviewer raises a critical question. FHD-286, a dual BRM/BRG1 inhibitor similar to FHT-1015 with optimized physical properties, has been evaluated in a Phase I trial in patients with metastatic uveal melanoma (NCT04879017), and it was well tolerated at continuous daily dose of up to 7.5 mg QD and at intermittent dose of up to 17.5 mg QD.  Manuscript describing results of this clinical trial is currently in preparation.

    1. Author response:

      Reviewer #1 (Public review):

      It appears obvious that with no or a little fitness penalty, it becomes beneficial to have MHC-coding genes specific to each pathogen. A more thorough study that takes into account a realistic (most probably non-linear in gene number) fitness penalty, various numbers of pathogens that could grossly exceed the self-consistent fitness limit on the number of MHC genes, etc, could be more informative.

      The reviewer seems to be referring to the cost of excessively high presentation breadth.  Such a cost is irrelevant to the inferior fitness of a polymorphic population with heterozygote advantage compared to a monomorphic population with merely doubled gene copy number.  It is relevant to the possibility of a fitness valley separating these two states, but this issue is addressed explicitly in the manuscript.

      An addition or removal of one of the pathogens is reported to affect "the maximum condition", a key ecological characteristic of the model, by an enormous factor 10^43, naturally breaking down all the estimates and conclusions made in [RS]. This observation is not substantiated by any formulas, recipes for how to compute this number numerically, or other details, and is presented just as a self-standing number in the text.

      It is encouraging that the reviewer agrees that this observation, if correct, would cast doubt on the conclusions of Siljestam and Rueffler.  I would add that it is not the enormity of this factor per se that invalidates those conclusions, but the fact that the automatic compensatory adjustment of c<sub>max</sub> conceals the true effects of removing a pathogen, which are quite large.

      I am not sure why the reviewer doubts that this observation is correct.  The factor of 2.7∙10<sup>43</sup> was determined in a straightforward manner in the course of simulating the symmetric Gaussian model of Siljestam and Rueffler with the specified parameter values.  A simple way to determine this number is to have the simulation code print the value to which c<sub>max</sub>  is set, or would be set, by the procedure of Siljestam and Rueffler for different parameter values.  In another section of this response I will describe how to do this with the simulation code written and used by Siljestam and Rueffler; doing so confirms the value that I obtained with my own code.  Furthermore, I will now give a theoretical derivation of this factor.

      As specified by Siljestam and Rueffler, the positions of the m pathogens in (m-1)-dimensional antigenic space correspond to the vertices of a regular simplex centered at the origin, with distance between vertices equal to 1.  The squared distance from the origin to each of the m vertices of such a simplex is (m-1)/2m (https://polytope.miraheze.org/wiki/Simplex).  Thus, the sum of the m squared distances is (m-1)/2.  For the (0, 0) homozygote, condition is multiplied by a factor of exp(-(vr)<sup>2</sup>/2) for each pathogen, where r is the distance from the origin.  It follows that, with v=20, all the pathogens together decrease condition by a factor of exp(20<sup>2</sup>∙(m-1)/4) = exp(100∙(m-1)).  Thus, increasing or decreasing m by 1 changes this value by a factor of exp(100) = 2.7∙10<sup>43</sup>.

      This begs the conclusion that the branching remains robust to changes in c_max that span 4 decades as well.

      That shows only that the results are not extremely sensitive to c<sub>max</sub> or K.  They are, nonetheless, exquisitely sensitive to m and v.  This difference in sensitivities is the reason that a relatively small change to m leads to such a large compensatory change in c<sub>max</sub> a change large enough to have a major effect on the results.

      As I wrote above, there is no explanation behind this number, so I can only guess that such a number is created by the removal or addition of a pathogen that is very far away from the other pathogens. Very far in this context means being separated in the x-space by a much greater distance than 1/\nu, the width of the pathogens' gaussians. Once again, I am not totally sure if this was the case, but if it were, some basic notions of how models are set up were broken. It appears very strange that nothing is said in the manuscript about the spatial distribution of the pathogens, which is crucial to their effects on the condition c.

      I did not explicitly describe the distribution of pathogens in antigenic space because it is exactly the same as in Siljestam and Rueffler, Fig. 4: the vertices of a regular simplex, centered at the origin, with unity edge length.

      The number in question (2.7∙10<sup>43</sup>) pertains to the Gaussian model with v=20.  As specified by Siljestam and Rueffler, each pathogen lies at a distance of 1 from every other pathogen, so the distance of any pathogen from the others is indeed much greater than 1/v.  This condition holds, however, for most of the parameter space explored by Siljestam and Rueffler (their Fig. 4), and for all of the parameter space that seemingly supports their conclusions.  Thus, if this condition indicates that “basic notions of how models are set up were broken”, they must have been broken by Siljestam and Rueffler.

      Overall, I strongly suspect that an unfortunately poor setup of the model reported in the manuscript has led to the conclusions that dispute the much better-substantiated claims made in [SD].

      The reviewer seems to be suggesting that my simulations are somehow flawed and my conclusions unreliable.  I will therefore describe how my conclusions about sensitivity to parameter values can be verified using the simulation code provided by Siljestam and Rueffler themselves, with only small, easily understood modifications.  I will consider adding this description as a supplement when I revise the manuscript.

      The starting point is the Matlab file MHC_sim_Dryad.m, available at https://doi.org/10.5061/dryad.69p8cz98j.  First, we can add a line that prints the value of the variable logcmax, which represents the natural logarithm of cmax determined and used by the code.  Below line 116 (‘prework’), add the line ‘logcmax’ (with no semicolon).

      Now, at the Matlab prompt, execute MHC_sim_Dryad(false, 8, 20, 1) to run the simulation for the Gaussian model with m=8, v=20, and K=1.  The output will indicate that logcmax=700, in accord with the theoretical factor exp(100*(m-1)) derived above.  The allelic diversity, n<sub>e</sub>, will rise to a steady state-level of about 140, as in the red curve of my Fig. 2.

      Now lower m to 7, i.e,  run MHC_sim_Dryad(false, 7, 20, 1).  The output will indicate that logcmax=600.  This confirms that lowering m by 1 causes the code to lower the value of c<sub>max</sub> by a factor exp(100)=2.7∙10<sup>43</sup>, which must also be the factor by which the condition of the most fit homozygote would increase without this adjustment.

      With the change of m to 7 and the compensatory change in c<sub>max</sub>, steady-state allelic diversity remains high.  But what if m changes but c<sub>max</sub> remains the same, as it would in reality?

      To find out, we can fix the value of c<sub>max</sub> to the value used with m=8 by adding the following line below the line previously added: ‘logcmax = 700’.  With this additional modification in place, executing MHC_sim_Dryad(false, 7, 20, 1) confirms that without a compensatory change to c<sub>max</sub>, lowering m from 8 to 7 mostly eliminates allelic diversity, in accord with the corresponding curve in my Fig. 2.  Similarly, raising m from 8 to 9, or changing v from 20 to 19.5 or 20.5 (executing MHC_sim_Dryad(false, 8, 19.5, 1) or MHC_sim_Dryad(false, 8, 20.5, 1)), largely eliminates diversity, confirming the other results in my Fig. 2.  Results for the bitstring model can also be confirmed, though this requires additional changes to the code.

      Thus, the extreme sensitivity of the results of Siljestam and Rueffler to parameter values can be verified with the code that they used for their simulations, indicating that my conclusions are not consequences of my having done a “poor setup of the model”.

      Response to Reviewer #2 (Public review):

      (1) The statement that the model outcome of Siljestam and Rueffler is very sensitive to parameter values is, in this form, not correct. The sensitivity is only visible once a strong assumption by Siljestam and Rueffler is removed. This assumption is questionable, and it is well explained in the manuscript by J. Cherry why it should not be used. This may be seen as a subtle difference, but I think it is important to pin done the exact nature of the problem (see, for example, the abstract, where this is presented in a misleading way).

      I appreciate the distinction, and the importance of clearly specifying the nature of the problem.  However, Siljestam and Rueffler do not invoke the implausible assumption that changes to the number of pathogens or their virulence will be accompanied by compensatory changes to c<sub>max</sub>.  Rather, they describe the adjustment of c<sub>max</sub> (Appendix 7) as a “helpful” standardization that applies “without loss of generality”.  Indeed, my low-diversity results could be obtained, despite such adjustment, by combining the small change to m or v with a very large change to K (e.g., a factor of 2.7∙10<sup>43</sup>).  In this sense there is no loss of generality, but the automatic adjustment of c<sub>max</sub> obscures the extreme sensitivity of the results to m and v.

      (2) The title of the study is very catchy, but it needs to be explained better in the text.

      I had hoped that the final paragraph of the Discussion would make the basis for the title clear.  I will consider whether this can be clarified in a revision.

    1. Chapter 4: Common Writing Assignments College writing assignments serve a different purpose than the typical writing assignments you completed in high school. The textbook Successful Writing explains that high school teachers generally focus on teaching you to write in a variety of modes and formats, including personal writing, expository writing, research papers, creative writing, and writing short answers and essays for exams. Over time, these assignments help you build a foundation of writing skills. In college, many instructors will expect you to already have that foundation. Your college composition courses will focus on writing for its own sake, helping you make the transition to college-level writing assignments. However, in most other college courses, writing assignments serve a different purpose. In those courses, you may use writing as one tool among many for learning how to think about a particular academic discipline. Additionally, certain assignments teach you how to meet the expectations for professional writing in a given field. Depending on the class, you might be asked to write a lab report, a case study, a literary analysis, a business plan, or an account of a personal interview. You will need to learn and follow the standard conventions for those types of written products. Finally, personal and creative writing assignments are less common in college than in high school. College courses emphasize expository writing, writing that explains or informs. Often expository writing assignments will incorporate outside research, too. Some classes will also require persuasive writing assignments in which you state and support your position on an issue. College instructors will hold you to a higher standard when it comes to supporting your ideas with reasons and evidence. Common Types of College Writing Assignments Below you will find a list of different types of writing assignments you may write as you pursue your academic goals. Review each assignment and think about the writing you’ve done in high school and how these assignments might look different in your college composition classes.   Figure 1   After reviewing Figure 1 and the descriptions of various types of writing assignments, watch the following video about the writing process. No matter what type of assignment you are writing, it will be important for you to follow a writing process: a series of steps a writer takes to complete a writing task. Making use of a writing process ensures that you stay organized and focused while allowing you to break up a larger assignment into several distinct tasks. Not every writer follows the same process, and part of the work you will do in your writing classes is to discover the writing process that works best for you. Even though the writing process is often presented as a linear set of steps that writers follow from beginning to end, composition scholars now recognize the recursive nature of writing. In other words, many writers repeat steps in the process and not all writers invest an equal amount of time in each stage. Instead, writers often loop back to individual stages as needed in order to develop and refine their work. As you watch the video below, consider your current writing process (if you have one) and reflect upon how you might develop your process to support your growth as a writer—and to save yourself time and stress when completing college writing assignments. In the previous chapters, we covered college writing at CNM and reading strateg

      The key to this is there are different types of writing assignments that has in the common writing assignments.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      This work computationally characterized the threat-reward learning behavior of mice in a  recent study (Akiti et al.), which had prominent individual differences. The authors  constructed a Bayes-adaptive Markov decision process model and fitted the behavioral data  by the model. The model assumed (i) hazard function starting from a prior (with free mean  and SD parameters) and updated in a Bayesian manner through experience (actually no real  threat or reward was given in the experiment), (ii) risk-sensitive evaluation of future  outcomes (calculating lower 𝛼 quantile of outcomes with free 𝛼 parameter), and (iii) heuristic  exploration bonus. The authors found that (i) brave animals had more widespread hazard  priors than timid animals and thereby quickly learned that there was in fact little real threat,  (ii) brave animals may also be less risk-aversive than timid animals in future outcome  evaluation, and (iii) the exploration bonus could explain the observed behavioral features,  including the transition of behavior from the peak to steady-state frequency of bout. Overall,  this work is a novel interesting analysis of threat-reward learning, and provides useful  insights for future experimental and theoretical work. However, there are several issues that I  think need to be addressed.

      Strengths:

      (1) This work provides a normative Bayesian account for individual differences in  braveness/timidity in reward-threat learning behavior, which complements the analysis by  Akiti et al. based on model-free threat reinforcement learning.

      (2) Specifically, the individual differences were characterized by (i) the difference in the  variance of hazard prior and potentially also (ii) the difference in the risk-sensitivity in the  evaluation of future returns.

      Weakness:

      (1) Theoretically the effect of prior is diluted over experience whereas the effect of biased  (risk-aversive) evaluation persists, but these two effects could not be teased apart in the  fitting analysis of the current data.

      (2) It is currently unclear how (whether) the proposed model corresponds to neurobiological ( rather than behavioral) findings, different from the analysis by Akiti et al.

      We thank reviewer #1 for their useful feedback which we’ve used to improve the discussion,  formatting and clarity of the paper, and for highlighting important questions for future  extensions of our work.

      Major points:

      (1) Line 219

      It was assumed that the exploration bonus was replenished at a steady rate when the animal  was at the nest. An alternative way would be assuming that the exploration bonus slowly  degraded over time or experience, and if doing so, there appears to be a possibility that the  transition of the bout rate from peak to steady-state could be at least partially explained by  such a decrease in the exploration bonus.

      Section 2.2.3 explains the mechanism of the exploration bonus which motivates approach.  We think that the mechanism suggested by the reviewer is, in essence, what is happening in  the model. The exploration pool is indeed depleted over time or bouts of experience at the  object. In the peak confident phase for brave animals and the peak cautious phase for timid  animals, the rate of depletion exceeds the rate of regeneration, since the agent spends only  a single turn at the nest between bouts. In the steady-state phase, the exploration pool has  depleted so much previously that the agent must wait multiple turns at the nest for the pool  to regenerate to a sufficiently high value to justify approaching the object again.

      We have updated section 2.2.3 to explain that agents spend one turn at the nest during peak  phase but multiple turns during steady-state phase. Hopefully, this makes our mechanism  clear:

      “In simulations, when 𝐺(𝑡) is high, the agent has a high motivation to explore the object,  spending only a single turn in the nest state between bouts. In other words, the depletion  from 𝐺0 substantially influences the time point at which approach makes a transition from  peak to steady-state; the steady-state time then depends on the dynamics of depletion  (when at the object) and replenishment (when at the nest). In particular, in the steady-state  phases, the agent must wait multiple turns at the nest for 𝐺(𝑡)  to regenerate so that  informational reward once again exceeds the potential cost of hazard.“

      (2) Line 237- (Section 2.2.6, 2.2.7, Figures 7, 9)

      I was confused by the descriptions about nCVaR. I looked at the cited original literature  Gagne & Dayan 2022, and understood that nCVaR is a risk-sensitive version of expected  future returns (equation 4) with parameter α (α-bar) (ranging from 0 to 1) representing risk  preference. Line 269-271 and Section 4.2 of the present manuscript described (in my  understanding) that α was a parameter of the model. Then, isn't it more natural to report  estimated values of α, rather than nCVaR, for individual animals in Section 2.2.6, 2.2.7,  Figures 7, 9 (even though nCVaR monotonically depends on α)? In Figures 7 and 9, nCVaR  appears to be upper-bounded to 1. The upper limit of α is 1 by definition, but I have no idea why nCVaR was also bounded by 1. So I would like to ask the authors to add more detailed  explanations on nCVaR. Currently, CVaR is explained in Lines 237-243, but actually, there is  no explanation about nCVaR rather than its formal name 'nested conditional value at risk' in  Line 237.

      Thank you for pointing out this error. We have corrected the paper to use nCVaR to refer to  the objective and nCVaR's α, or sometimes just α, to refer to the risk sensitivity parameter  and thus the degree of risk sensitivity.

      (3) Line 333 (and Abstract)

      Given that animals' behaviors could be equally well fitted by the model having both nCVaR ( free α) and hazard prior and the alternative model having only hazard prior (with α = 1), may  it be difficult to confidently claim that brave (/timid) animals had risk-neutral (/risk-aversive)  preference in addition to widespread (/low-variance) hazard prior? Then, it might be good to  somewhat weaken the corresponding expression in the Abstract (e.g., add 'potentially also'  to the result for risk sensitivity) or mention the inseparability of risk sensitivity and prior belief  pessimism (e.g., "... although risk sensitivity and prior belief pessimism could not be teased  apart").

      Thank you for this suggestion, we have duly weakened the wording in the Abstract to say  “potentially more risk neutral”:

      “Some animals begin with cautious exploration, and quickly transition to confident approach  to maximize exploration for reward; we classify them as potentially more risk neutral, and  enjoying a flexible hazard prior. By contrast, other animals only ever approach in a cautious  manner and display a form of  self-censoring; they are characterized by potential risk  aversion and high and inflexible hazard priors.”

      Reviewer #2 (Public Review):

      Shen and Dayan build a Bayes adaptive Markov decision process model with three key  components: an adaptive hazard function capturing potential predation, an intrinsic reward  function providing the urge to explore, and a conditional value at risk (CvaR, closely related  to probability distortion explanations of risk traits). The model itself is very interesting and  has many strengths including considering different sources of risk preference in generating  behavior under uncertainty. I think this model will be useful to consider for those studying  approach/avoid behaviors in dynamic contexts.

      The authors argue that the model explains behavior in a very simple and unconstrained  behavioral task in which animals are shown novel objects and retreat from them in various  manners (different body postures and patterns of motor chunks/syllables). The model itself  does capture lots of the key mouse behavioral variability (at least on average on a  mouse-by-mouse basis) which is interesting and potentially useful. However, the variables in  the model - and the internal states it implies the mice have during the behavior - are  relatively unconstrained given the wide range of explanations one can offer for the mouse  behavior in the original study (Akiti et al). This reviewer commends the authors on an original  and innovative expansion of existing models of animal behaviour, but recommends that the  authors  revise their study to reflect the obvious  challenges . I would also recommend a  reduction in claiming that this exercise gives a normative-like or at least quantitative account  of mental disorders.

      We thank reviewer #2 for highlighting some of the strengths of our paper as well as pointing  out important limitations of Akiti et al’s original study which we’ve inherited as well as some  limitations of our own method. We address their concerns below.

      We have added a paragraph to the discussion discussing the limitations of the state  representation we adopted from Akiti’s study.

      (Reviewer #1 had the same concern, see above) “Motivated by tail-behind versus  tail-exposed in Akiti et al. (2022), we model approach using a dichotomy between cautious  and confident approach states [...]”

      We have reduced the suggestion that our model provides an account of mental disorders in  the abstract.

      Before:

      “On the other hand, “timid” animals, characterized by risk aversion and high and inflexible  hazard priors, display self-censoring that leads to the sort of asymptotic maladaptive  behavior that is often associated with psychiatric illnesses such as anxiety and depression.”

      After:

      “By contrast, other animals only ever approach in a cautious manner and display a form of  self-censoring; they are characterized by potential risk aversion and high and inflexible  hazard priors. “

      My main comment is that this paper is a very nice model creation that can characterize the  heterogeneity rodent behavior in a very simple approach/avoid context (Akiti et al; when a  novel object is placed in an arena) that itself can be interpreted in a multitude of ways. The  use of terms like "exploration", "brave", etc in this context is tricky because the task does not  allow the original authors (Akiti et al) to quantify these "internal states" or "traits" with the  appropriate level of quantitative detail to say whether this model is correct or not in capturing  the internal states that result in the rodent behavior. That said, the original behavioral setup  is so simple that one could imagine capturing the behavioral variability in multiple ways ( potentially without evoking complex computations that the original authors never showed  the mouse brain performs). I would recommend reframing the paper as a new model that  proposes a set of internal states that could give rise to the behavioral heterogeneity  observed in Akiti et al, but nonetheless is at this time only a hypothesis. Furthermore, an  explanation of what would be really required to test this would be appreciated to make the  point clearer.

      We thought very hard about using terms that might be considered to be anthropomorphic  such as ‘timid’ and ‘brave’. We are, of course, aware, of the concerns articulated by  investigators such as LeDoux about this. However, we think that, provided that we are clear  on the first appearance (using ‘scare’ quotes) that we are using them as indeed labels for  latent characteristics that capture correlations in various aspects of behaviour, they are more  helpful than harmful in making our descriptions understandable.

      Reviewer #3 (Public Review):

      Summary:

      The manuscript presents computational modelling of the behaviour of mice during  encounters with novel and familiar objects, originally reported by Akiti et al. (Neuron 110, 2022)          . Mice typically perform short bouts of approach followed by a retreat to a safe  distance, presumably to balance exploration to discover possible rewards with the potential  risk of predation. However, there is considerable heterogeneity in this exploratory behaviour,  both across time as an individual subject becomes more confident in approaching the object,  and across subjects; with some mice rapidly becoming confident to closely explore the  object, while other timid mice never become fully confident that the object is safe. The  current work aims to explain both the dynamics of adaptation of individual animals over time,  and the quantitative and qualitative differences in behaviour between subjects, by modelling  their behaviour as arising from model-based planning in a Bayes adaptive Markov Decision  Process (BAMDP) framework, in which the subjects maintain and update probabilistic  estimates of the uncertain hazard presented by the object, and rationally balance the  potential reward from exploring the object with the potential risk of predation it presents.

      In order to fit these complex models to the behaviour the authors necessarily make  substantial simplifying assumptions, including coarse-graining the exploratory behaviour into  phases quantified by a set of summary statistics related to the approach bouts of the animal.  Inter-individual variation between subjects is modelled both by differences in their prior  beliefs about the possible hazard presented by the object and by differences in their risk  preference, modelled using a conditional value at risk (CVaR) objective, which focuses the  subject's evaluation on different quantiles of the expected distribution of outcomes.  Interestingly these two conceptually different possible sources of inter-subject variation in  brave vs timid exploratory behaviour turn out not to be dissociable in the current dataset as  they can largely compensate for each other in their effects on the measured behaviour.  Nonetheless, the modelling captures a wide range of quantitative and qualitative differences  between subjects in the dynamics of how they explore the object, essentially through  differences in how subject's beliefs about the potential risk and reward presented by the  object evolve over the course of exploration, and are combined to drive behaviour.

      Exploration in the face of risk is a ubiquitous feature of the decision-making problem faced  by organisms, with strong clinical relevance, yet remains poorly understood and  under-studied, making this work a timely and welcome addition to the literature.

      Strengths:

      (1) Individual differences in exploratory behaviour are an interesting, important, and  under-studied topic.

      (2) Application of cutting-edge modelling methods to a rich behavioural dataset, successfully  accounting for diverse qualitative and qualitative features of the data in a normative  framework.

      (3) Thoughtful discussion of the results in the context of prior literature.

      Limitations:

      (1) The model-fitting approach used of coarse-graining the behaviour into phases and fitting  to their summary statistics may not be applicable to exploratory behaviours in more complex  environments where coarse-graining is less straightforward.

      (2) Some aspects of the work could be more usefully clarified within the manuscript.

      We thank reviewer #3 for their positive feedback and helping us to improve the clarity of our  paper. We have added discussion they thought was missing.

      Reviewer #1 (Recommendations for the authors):

      (1) Line 25-28

      This part of the Abstract might give an impression that timidity (but not braveness) is  potentially associated with psychiatric illness and even that timidity is thus inferior to  braveness. However, even though extreme timidity might indeed be associated with anxiety  or depression, extreme braveness could also be associated with other psychiatric or  behavioral problems. Moreover, as a population, the existence of both timid and brave  individuals could be advantageous, and it could be a reason why both types of individuals  evolutionarily survived in the case of wild animals (although Akiti et al. used mice, which may  have no or very limited genetic varieties, and so things may be different). So I would like to  encourage the authors to elaborate on the expression of this part of the Abstract and/or  enrich the related discussion in the Discussion.

      This is an important point. We note on line 38 that excessive novelty seeking (potentially  caused by excessive braveness) could also be maladaptive.

      Additionally, we have added a paragraph to the discussion discussing heterogeneity in risk  sensitivity within a population.

      “Our data show that there is substantial variation in the degrees of risk sensitivity across the  mice.  Previous works have reported substantial interpopulation and intrapopulation  differences in risk-sensitivity in humans which depend on gender, age, socioeconomic  status, personality characteristics, wealth and culture (Rieger et al., 2015; Frey et al., 2017).  Despite the normative appeal of 𝛼 = 1, it is possible that a population may benefit from  including individuals with $\alpha$ different from 1.0 or highly negative priors. For example,  more cautious individuals could learn from merely observing the risky behavior of less  cautious individuals. Furthermore, we have only considered risk-sensitivity under epistemic  uncertainty in our work. Risk averse individuals, for instance with 𝛼 < 1 may be more  successful than risk-neutral agents in environments where there are unexpected dangers ( unknown unknowns). Risk-aversion is thus a temperament of ecological and evolutionary  significance (Réale et al., 2007).”

      (2) Line 149

      Section 2.2 consists of eight subsections. I think this organization may not be very  appealing, because there are a bit too many subsections, and their relations are not  immediately clear to readers. So I would like to encourage the authors to make an  elaboration. For example, since 2.2.1 - 2.2.5 describes a summary of model construction  and model fitting whereas 2.2.6-2.2.8 shows the results, it could be good to divide these into  separate sections (2.2.1 - 2.2.5 and 2.3.1 - 2.3.3).

      Thank you for pointing this out. We’ve renumbered the sections as you’ve suggested.

      (3) Line 347-8

      Theoretically, the effect of prior is diluted over experience whereas the effect of biased  (risk-aversive) evaluation persists, as the authors mentioned in Lines 393-394. Then isn't it  possible to consider environments/conditions in which the two effects can be separated?

      We appreciate this suggestion. Indeed, our original thought in modeling this experiment was  that this would be exactly the case here - with epistemic uncertainty reducing as the object  became more familiar. However, proving to an animal that a single environment is  completely stationary/fixed is hard - reflected in our conclusion here that the exploration  bonus pool replenishes. Thus, we argued in the discussion that a series of environments  would be necessary to separate risk sensitivity from priors.

      (4) Line 407

      It would be nice to add a brief phrase explaining how (in what sense) this model's  assumption was consistent with the reported behavior. Also, should the assumption of  having two discrete approach states (cautious and confident) itself be regarded as a  limitation of the model? If the tail-behind and tail-exposure approaches were not merely  operationally categorized but were indicated to be two qualitatively distinct behaviors in the  experiment by Akiti et al., it is reasonable to model them as two discrete states, but  otherwise, the assumption of two discrete states would need to be mentioned as a  simplification/limitation.

      We have now removed line 407, and now have an additional  paragraph in the discussion  discussing the limitations of the tail-behind and tail-exposure state representation: “Motivated by tail-behind versus tail-exposed in Akiti et al. (2022), we model approach using  a dichotomy between cautious and confident approach states. This is likely a crude  approximation to the continuous and multifaceted nature of animal approach behavior. For  example, during approach animals likely adjust their levels of vigilance continuously (or  discretely; Lloyd and Dayan (2018)) to  monitor threat, and choose different velocities for  movement, and different attentional strategies for inspecting the novel object. We hope  future works will model these additional behavioral complexities, perhaps with additional  internal states, and corroborate these states with neurobiological data.”

      (5) Line 418

      The authors contrasted their model-based analyses with the model-free analyses of Akiti et  al. Another aspect of differences between the authors' model and the model of Akiti et al. is  whether it is normative or mechanistic: while how the model of Akiti et al. can be biologically  implemented appears to be clear (TS dopamine represents threat TD error, and TS  dopamine-dependent cortico-striatal plasticity implements TD error-based update of  model-free threat prediction), biological implementation of the authors' model seems more  elusive. Given this, it might be a fruitful direction to explore how these two models can be  integrated in the future.

      We enthusiastically agree that it would be most interesting in the future to explore the  integration of the two models - and, in the discussion ( Lines 537-548, 454-461) , point to  some first steps that might be fruitful along these lines. There are two separate  considerations here: one is that our account is mostly computational and algorithmic,  whereas Akiti’s model is mostly algorithmic and implementational; the second is, as noted by  the reviewer, that our account is model-based, whereas Akiti’s model is model-free (in the  sense of reinforcement learning; RL). These are related - thanks in no small part to the work  from the group including Akiti, we know a lot more about the implementation of model-free  than model-based RL. However, our model-based account does reach additional features of  behavior not captured in Akiti et al.’s model such as bout duration, frequency, and approach  type. Thus, the temptation of unification.

      (6) Line 426

      Related to the previous point, it would be nice to more specifically describe what variable TS  dopamine can represent in the authors' model if possible.

      In the discussion  (Lines 454-461) , we speculate that  TS dopamine could still respond to the  physical salience of the novel object and affect choices by determining the potential cost of  the encountered threat or the prior on the hazard function. For example, perhaps ablating TS  dopamine reduces the hazard priors which leads to faster transition from cautious to  confident approach and longer bout durations, consistent with the optogenetics behavioral  data reported in Akiti et al.

      Reviewer #2 (Recommendations for the authors):

      My guess is simpler versions of the model would not fit the data well. But this does not mean  for example that the mice have probability distortions (CvaR) or that even probabilistic  reasoning and the internal models necessary to support them are acting in the behavioral  context studied by Akiti. So related to the above, I would ask what other models would fit and  would not fit the data? And what does this mean?

      These are good points. Our model provides an approximately normative account of the  animals’ behavior  in terms of what it achieves relative to a utility function. In practice, the  animals could deploy a precompiled model-free policy (which does not rely on probabilistic  computations) that is exactly equivalent to our model-based policy. With the current  experiment, we cannot conclude whether or not the animals are performing the prospective  calculations in an online manner. Of course, the extent to which animals or humans are  performing probabilistic computations online and have internal models are on-going  questions of study.

      Model comparison is difficult because currently we do not know of any other risk-sensitive  exploration models. We cannot directly compare to the model in Akiti et al. since our model  explains additional features of behavior: bout duration, frequency, and approach type.  Indeed, our model is as simple as it can be in the sense with the exception of nCVaR,  removing any of the other parameters makes it difficult to fit some animals in our dataset. In the future, our model could be used to fit other datasets of risk-sensitive exploration and,  ideally,  be compared to other models.

      Explaining why animals avoid the novel object in what the offers call benign environment is a  very tricky issue. In Akiti et al, the readers are not yet convinced that the mice know that this  environment is benign. Being placed in an arena with a novel object presents mice with a  great uncertainty and we do not know whether they treat this as benign. Therefore, the  alternative explanations in this study need to be carefully discussed in lieu of the limitations  of the initial study.

      It is certainly true that it is unclear if the arena is  completely  benign to the animals. However,  the amount of time the animal spends in the center of the arena decreases significantly from  habituation to novelty days. This suggests that the animals avoid the novel object largely  because of the object itself, rather than the potential danger associated with the arena.  Furthermore, the animals are not reported as exhibiting more extreme behaviours such as  freezing. In any case, our account is relative in the sense that we are comparing the time the  animal spends at the object versus elsewhere in the environment, driven by the relative  novelty and relative risk of the environment versus the object. Trying to get more absolute  measures of these quantities would require a richer experimental set-up, for instance with  different degree of habituation or experience of the occurrence of (other) novel objects, in  general.

      We added a short note to the discussion to explain this:

      “Fourth, we modeled the relative amount of time the animal spends at the object versus  elsewhere in the environment which depends on the differential risk in the two states.  However, it is likely the animals avoid the novel object largely because of the object itself,  rather than the potential danger associated with the arena since they spend much less time  at the center of the arena during novelty than habituation days.”

      Figure 2 - how confident are the authors that each mouse differs from y=1? Related to this,  the behavior in Akiti is very noisy and changes across time. I am not sure if the authors fully  describe at what levels their model captures the behavior vs not in a detailed enough  fashion.

      We have performed a random permutation test on the minute-to-minute data. We have  updated Figure 2 so that brave animals that pass the Benjamini–Hochberg procedure y>1 at  level q=0.05 are represented with solid green dots and animals that don’t pass are  represented with hollow dots. 8 out of 11 brave animals passed Benjamini–Hochberg.

      Reviewer #3 (Recommendations for the authors):

      (1) I could not find information in the preprint about code availability. Please consider making  the code public to help others apply these modelling methods.

      We have released code and included the url in the paper in the Methods section.

      (2) Though the manuscript was generally clearly written, there were a number of places  where some additional information or clarification would be useful:

      a) Please define and explain the terms 'tail-behind' and 'tail-exposed' (used to describe  approach bout types) when first used.

      We have added definitions when we first mention these terms:

      “[...] 'tail-behind' (bouts where the animal's nose was closer to the object than the tail for the  entire bout) and 'tail-exposed' (bouts where the animal's tail is closer to the object than the  nose at some point during the bout), associated respectively with cautious risk-assessment  and engagement”

      b) At lines 57-58 when contrasting the 'model-free' account of Akiti et al with the 'model-based' account of the current work, it would be worth clarifying that these terms are  being used in the RL sense rather than e.g. a model-based analysis of the data.  

      We have updated the relevant lines to say “model-free/based reinforcement learning”.

      c) Line 61, the phrase 'the significant long-run approach of timid animals despite having  reached the "avoid" state' is unclear as the 'avoid' state has not been defined.

      We updated the terminology to “avoidance behavior” to be consistent with Akiti et al.  Avoidance refers to the animal routinely avoiding the object and therefore being unable to  learn whether it is safe.

      d) It was not completely clear to me how the coarse-graining of the behaviour was  implemented. Specifically, how were animals assigned to the brave, intermediate, or timid  group, and how were the parameters of the resulting behavioural phases fit?

      Sorry that this was not clear. Section 2.1 explains how the minute-to-minute behavioral data  was coarse-grained and how animal groups were assigned. We have added further  explanation of Figure 2 to the main text:

      “Fig 2 summarizes our categorization of the animals into the three groups: brave,  intermediate, and timid based on the phases identified in the animal's exploratory  trajectories. Timid animals spend no time in confident approach and are plotted in orange at  the origin of Fig 2. Brave animals differ from intermediate animals in that their approach time  during the first ten minutes of the confident phase is greater than the last ten minutes ( steady-state phase). Brave animals are plotted in green above and intermediate animals  are plotted in black below the y=1 line in Fig 2.”

      We also added extra information to outline the goal, and methodology of coarse-graining and  animal grouping:

      “We sought to capture  these qualitative differences (cautious versus confident) as well as  aspects of the quantitative changes in bout durations and frequencies as the animal learns  about their environment. To make this readily possible, we abstracted the data in two ways:

      averaging  bout statistics over time, and clustering the animals into three groups with  operationally distinct behaviors.”

      e) What purpose does the 'retreat' state serve in the BAMDP model (as opposed to  transitioning directly from 'object' to 'nest' states), and why do subjects not pass through it  following 'detect' states?

      Thank you for pointing this out. We have updated Figure 3 to note that the two “detected  states” also point to the “retreat” state. The reviewer is correct that there could be alternative  versions of the state diagram, and the ‘retreat’ state could indeed have been eliminated.  However, we thought that it was helpful to structure the animal’s progress through state  space.

      f) Why was the hazard function parameterised via the mean and SD at each time step rather  than with a parametric form of the mean and SD as a function of time?

      Since the agent can only spend 2, 3, or 4 turns at the object states, we didn’t see a need to  parameterize the mean and SD as a function of time. Doing so is a good solution to scaling  up the hazard function to more time-steps.

      (3) There were also a couple of points that could potentially be usefully touched on in the  discussion:

      a) What, if any, is the relationship between the CVaR objective and distributional RL? They  seem potentially related due to both focussing on quantiles of the outcome distribution.

      We have added a paragraph to the discussion discussing the connection between  distributional RL and CVaR:

      “CVaR is known to come in different flavors in the case of temporally-extended behavior.  Gagne and Dayan (2021) introduces two alternative time-consistent formulations of CVaR:  nested CVaR (nCVaR) and precommitted CVaR (pCVaR). nCVaR and pCVaR both enjoy  Bellman equations which make it possible to compute approximately optimal policies without  directly computing whole distributions of the outcomes. We use nCVaR in this study for its  computational efficiency. There is, of course, great current interest in distributional  reinforcement learning (Bellemare et al., 2023b) which does acquire such whole  distributions, not the least because of prominent observations linking non-linearities in the  response functions of dopamine neurons to methods for learning distributions of outcomes ( Dabney et al., 2020; Masset et al., 2023; Sousa et al., 2023). One functional motivation for  considering entire outcome distributions is the possibility of using them to determine  risk-sensitive policies (Gagne and Dayan, 2021).

      While it is possible to compute CVaR directly from return distributions, Gagne and Dayan  (2021) showed that this can lead to temporally inconsistent policies where the agent  deviates from its original plans (the authors called this the fixed CVaR or fCVaR measure).

      Rather further removed from our model-based methods is work from Antonov and Dayan  (2023), who consider a model-free exploration strategy which exploits full return distributions  to compute the value of perfect information which is used as a heuristic for trying actions  with uncertain consequences. Future works can examine risk-sensitive versions of Antonov  and Dayan (2023)'s computationally efficient model-free algorithm as one solution to the  burdensome computations in our model-based method.”

      b) Why normatively might subjects have non-neutral risk preference as captured by the  CvaR?

      We also added a paragraph to the discussion discussing the advantage of heterogeneity in  risk sensitivity within a population:

      (Reviewer #1 had the same question, see above) “Our data show that there is substantial  variation in the degrees of risk sensitivity across the mice.  Previous works have reported  substantial interpopulation and intrapopulation differences in risk-sensitivity in humans which  depend on gender, age, socioeconomic status, personality characteristics, wealth and culture [...]”

      c) Relevance of the current modelling work to clinical conditions characterised by  dysregulation of risk assesment (e.g. anxiety or PTSD).

      We’ve added a paragraph to the discussion:

      “Inter-individual differences in risk sensitivity are also of critical importance in psychiatry,  reflected in a panoply of anxiety disorders (Butler and Mathews, 1983; Giorgetta et al., 2012;  Maner et al., 2007; Charpentier et al., 2017), along with worry and rumination (Gagne and  Dayan, 2022). Understanding the spectrum of   extreme priors and extreme values of 𝛼  could have therapeutic implications, adding significance to the search for tasks that can  more cleanly separate them.”

      d) Is it surprising to see differences in risk preference (nCVaR) between the familiar object  and novel object condition, given that risk preference might be conceptualised as a trait  rather than a state variable?

      Thank you for raising this point. You are right that we expected risk sensitivity (nCVaR alpha)  to be the same between FONC and UONC animals on average. It is difficult to know if alpha  is higher for FONC than UONC animals due to the non-identifiability between alpha and  hazard priors. We have added this discussion to the paper:

      “This is surprising if we interpret 𝛼 as a trait that is stable through time. Unfortunately, due to  the non-identifiability between 𝛼 and hazard priors, we cannot verify whether 𝛼 is actually  higher for FONC animals than UONC animals.”

    1. Author response:

      Reviewer #1 (Public review):

      Summary:

      The study is methodologically solid and introduces a compelling regulatory model. However, several mechanistic aspects and interpretations require clarification or additional experimental support to strengthen the conclusions.

      Strengths:

      (1) The manuscript presents a compelling structural and biochemical analysis of human glutamine synthetase, offering novel insights into product-induced filamentation.

      (2) The combination of cryo-EM, mutational analysis, and molecular dynamics provides a multifaceted view of filament assembly and enzyme regulation.

      (3) The contrast between human and E. coli GS filamentation mechanisms highlights a potentially unique mode of metabolic feedback in higher organisms.

      Weaknesses:

      (1) The mechanism underlying spontaneous di-decamer formation in the absence of glutamine is insufficiently explored and lacks quantitative biophysical validation.

      (2) Claims of decamer-only behavior in mutants rely solely on negative-stain EM and are not supported by orthogonal solution-based methods.

      We thank the reviewer for the summary and noting of the strengths. We agree that the evolutionary divergence of metabolic feedback in GS homologs is a fruitful avenue for future studies. With regard to the weaknesses, the di-decamer in the absence of glutamine only forms under high (higher than physiological) concentrations of enzyme. Our primary evidence for the mutant behavior was the lack of crosslinking (Figure 1E), with supplementary support from the negative stain. In the revised version we will soften the language to say “reduced” rather than “did not support” filament formation.

      Reviewer #2 (Public review):

      The authors set out to resolve the high-resolution structure of a glutamine synthetase (GS) decamer using cryo-EM, investigate glutamine binding at the decamer interface, and validate structural observations through biochemical assays of ATP hydrolysis linked to enzyme activity. Their work sits at the intersection of structural and functional biology, aiming to bridge atomic-level details with biological mechanisms - a goal with clear relevance to researchers studying enzyme catalysis and metabolic regulation.

      Strengths and weaknesses of methods and results:

      A key strength of the study lies in its use of cryo-EM, a technique well-suited for resolving large, dynamic macromolecular complexes like the GS decamer. The reported resolutions (down to 2.15 Å) initially suggest the potential for detailed structural insights, such as side-chain interactions and ligand density. However, several methodological limitations significantly undermine the reliability of the results:

      (1) Cryo-EM data processing: The absence of critical details about B-factor sharpening - a standard step to enhance map interpretability - is a major concern. For high-resolution maps (<3 Å), sharpening is typically applied to resolve side-chain features, yet the submitted maps (e.g., those in Figures 1D, 2D, and supplementary figures) appear unprocessed, with density quality inconsistent with the claimed resolutions. This makes it difficult to evaluate whether observed features (e.g., glutamine binding) are genuine or artifacts of unsharpened data.

      (2) Modeling and density consistency: The structural models, particularly for glutamine binding at the decamer interface, do not align with the reported resolution. The maps shown in Figure 2D and Supplementary Figure S7 lack sufficient density to confidently place glutamine or even surrounding residues, conflicting with claims of 2.15 Å resolution. Additionally, fitting a non-symmetric ligand (glutamine) into a symmetry-refined map requires justification, as symmetry constraints may distort ligand placement.

      (3) Biochemical assay controls: While the enzyme activity assays aim to link structure to function, they lack essential controls (e.g., blank reactions without GS or substrates, substrate omission tests) to confirm that ATP hydrolysis is GS-dependent. The use of TCEP, a reducing agent, is also not paired with experiments to rule out unintended effects on the PK/LDH system, further limiting confidence in activity measurements.

      Achievement of aims and support for conclusions:

      The study falls short of convincingly achieving its goals. The claimed high-resolution structural details (e.g., side-chain densities, ligand binding) are not supported by the provided maps, which lack sharpening and show inconsistencies in density quality. Similarly, the biochemical data do not robustly validate the structural claims due to missing controls. As a result, the evidence is insufficient to confirm glutamine binding at the decamer interface or the functional relevance of the observed structural features.

      Likely impact and utility:

      If these methodological gaps are addressed, the work could make a meaningful contribution to the field. A well-resolved GS decamer structure would advance understanding of enzyme assembly and ligand recognition, while validated biochemical assays would strengthen the link between structure and function. Improved data processing and clearer reporting of validation steps would also make the structural data more reliable for the community, providing a resource for future studies on GS or related enzymes.

      We disagree with the reviewer’s overall assessment.

      With regard to sharpening and resolution: we examined sharpened maps and in a revised version will present additional supplementary figures showing these maps side by side. We note that the resolutions reported are global and that the most interesting features are, of course, in the periphery and subject to conformational and compositional heterogeneity. We will include supplementary figures of core side chain densities that are more like what are expected by the reviewer in the revision. 

      With regard to modeling: the apo filament and turnover filament datasets were handled nearly identically. The additional density is therefore likely not artefactual to the symmetry operator - however, the lower resolution in this region noted by the reviewer is worthy of further exploration. The maps are public and we think this is the most plausible interpretation of the density, which we based primarily on the biochemical data and will include more speculation in the version.

      With regard to the biochemical controls: we point the reviewer to Figure S1, which shows that omission of ammonia or glutamate in the wild-type (tagless) system removes any coupling of the reactions. We will perform the additional controls to publication quality in the revised version along with the TCEP control. We note that the reducing agent is present across all experiments, ruling out an effect on any specific result. The inclusion of TCEP is also very standard in other published uses of the Coupled ATPase assay (e.g. PMID: 31778111 and PMID: 32483380 by our first author)

      Additional context:

      Cryo-EM has transformed structural biology by enabling high-resolution analysis of large complexes, but its success hinges on rigorous data processing and validation steps that are critical to ensuring reproducibility. The challenges highlighted here are not unique to this study; they reflect broader issues in the field where incomplete reporting of methods can obscure the reliability of results. By addressing these points, the authors would not only strengthen their current work but also set a positive example for transparent and rigorous structural biology research.

      All the data is public and the reviewer or anyone is free to reinterpret the maps and models - and we encourage that rather than just an interpretation of our static figures. In addition, we will upload the raw micrograph data for the apo filament and turnover filament datasets to EMPIAR prior to submitting the revision.

      Reviewer #3 (Public review):

      In this manuscript, the authors propose a product-dependent negative-feedback mechanism of human glutamine synthetase, whereby the product glutamine facilitates filament formation, leading to reduced catalytic specificity for ammonia. Using time-resolved cryo-EM, the authors demonstrate filament formation under product-rich conditions. Multiple high-quality structures, including decameric and di-decameric assemblies, were resolved under different biochemical states and combined with MD simulations, revealing that the conformational space of the active site loop is critical for the GS catalysis. The study also includes extensive steady-state kinetic assays, supporting the view that glutamine regulates GS assembly and its catalytic activity. Overall, this is a detailed and comprehensive study. However, I would advise that a few points be addressed and clarified.

      (1) In Figure 2D and Supplementary Figure 7, the extra density observed between the two decamers does not appear to have the defining features of a glutamine. A less defined density may be expected given the nature of the complex, but even though mutagenesis assays were performed to support this assignment, none of these results constitutes direct and conclusive evidence for glutamine binding at this site. I would thus suggest showing the density maps at multiple contour thresholds to allow readers to also better evaluate the various small molecules under turnover conditions that cannot be well fitted based on this density map, helping to provide a more balanced interpretation of the results.

      (2) On the same point regarding the density for the enzyme under turnover conditions, more details should be provided about the symmetry expansion and classification performed, and also show the approximate ratio of reconstructions that include this density. Did you try symmetry expansion followed by focused classification, especially on the interface region?

      (3) The interface between the two decamers of the model needs to be double-checked and reassigned, especially for the residues surrounding the fitted glutamine. For example, the side chain of the Lys residue shown in the attached figure is most likely modeled incorrectly.

      We thank the reviewer for the feedback. As noted above, we will include supplemental figures that show maps at multiple thresholds and sharpening schemes. We noted in the manuscript and above that our interpretation here is based on integrating biochemical evidence alongside the density and will make that even more clear in the revised manuscript. The filaments +/- the putative glutamine density were processed nearly identically, but we will attempt various schemes of focused classification/symmetry expansion in the revision as well. However, we point out that there is extensive averaging there that makes modeling a bit trickier than expected given the global resolution.

    1. Praising students for merely meeting expectations may reduce student behavior over time as it “cheapens” your praise.

      This is something I agree with wholeheartedly. And I think it is because I see this in my job, we have an "Employee of the Quarter" program and it sounds wonderful on the surface level but the unfortunate reality is that every single person will eventually get this award even if they don't deserve it. This will cause employees to think "Oh I can get this extra special recognition and this award just for being here/doing below the bare minimum/doing the bare minimum,,."

    1. Group G Ben Braniff, Kim Maynard, Nick Devic, Maria Echeverri Solis, Sam Yalda

      1. Design has a major impact on the world and society. Even the little things can add up to a lot. Sustainability is a revolutionary Idea that should be at the core of every design now.

      2. Society is another bottom line meaning all design inherently affects humans and/or is designed for humans. It's important to design for the extremes and the edge cases like people with disabilities.

      3. Corporations output a lot of waste. When they make small changes to be more sustainable, it results in big changes and saving a lot of material. Small changes can include anything from using 2% less plastic per water bottle to using wood buttons instead of plastic ones.

      4. A lot of people don't consider themselves disabled, but it's very common at some point in people's lives to have a certain level of impairment. It's important to keep this in mind when designing as you're designing for the general population--not just a specific individual.

      5. Addressing issues like world hunger may require rethinking the way we design food production. As they stated for example, choosing kangaroo meat over beef as a more environmentally sustainable option.

      6. Thoughtful design choices per the example in the video such as adding white circles inside letters to reduce ink use, can improve efficiency and conserve resources.

      7. It is interesting how he opens up his discussion to slowly introduce that design isn't just about doing it for marketing or 'profit' as he pointed out. When watching this it helps a person realize that design is so much more powerful than that if you put it towards another cause. Design could end up being the solution to some of the biggest problems in society.

      8. A very important point he made was that improving accessibility is beneficial to many more people than just the people that initially needed it such as people with disabilities. From this i think a good takeaway is that design should always be considerate of any disabilities/needs that the audience might have because sometimes that design is just better for everyone in general.

      9. My first take is design should go beyond money and aesthetics. By thinking about sustainability and accessibility the designers can create solutions that are socially responsible and environmentally friendly.

      10. My second take is when you design with people with disabilities you end up with solutions that are more usable and inclusive

    1. Author response:

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      This work aims to elucidate the molecular mechanisms affected in hypoxic conditions, causing reduced cortical interneuron migration. They use human assembloids as a migratory assay of subpallial interneurons into cortical organoids and show substantially reduced migration upon 24 hours of hypoxia. Bulk and scRNA-seq show adrenomedullin (ADM) up-regulation, as well as its receptor RAMP2, confirmed atthe protein level. Adding ADM to the culture medium after hypoxic conditions rescues the migration deficits, even though the subtype of interneurons affected is not examined. However, the authors demonstrate very clearly that ineffective ADM does not rescue the phenotype, and blocking RAMP2 also interferes with the rescue. The authors are also applauded for using 4 different cell lines and using human fetal cortex slices as an independent method to explore the DLXi1/2GFP-labelled iPSC-derived interneuron migration in this substrate with and without ADM addition (after confirming that also in this system ADM is up-regulated). Finally, the authors demonstrate PKA-CREB signalling mediating the effect of ADM addition, which also leads to up-regulation of GABAreceptors. Taken together, this is a very carefully done study on an important subject - how hypoxia affects cortical interneuron migration. In my view, the study is of great interest.

      Strengths:

      The strengths of the study are the novelty and the thorough work using several culture methods and 4 independent lines.

      Weaknesses:

      The main weakness is that other genes regulated upon hypoxia are not confirmed, such that readers will not know until which fold change/stats cut-off data are reliable.

      Reviewer #2 (Public review):

      Summary

      The manuscript by Puno and colleagues investigates the impact of hypoxia on cortical interneuron migration and downstream signaling pathways. They establish two models to test hypoxia, cortical forebrain assembloids, and primary human fetal brain tissue. Both of these models provide a robust assay for interneuron migration. In addition, they find that ADM signaling mediates the migration deficits and rescue using exogenous ADM.

      Strengths:

      The findings are novel and very interesting to the neurodevelopmental field, revealing new insights into how cortical interneurons migrate and as well, establishing exciting models for future studies. The authors use sufficient iPSC lines including both XX and XY, so the analysis is robust. In addition, the RNAseq data with re-oxygenation is a nice control to see what genes are changed specifically due to hypoxia. Further, the overall level of validation of the sequencing data and involvement of ADM signaling is convincing, including the validation of ADM at the protein level. Overall, this is a very nice manuscript.

      Weaknesses:

      I have a few comments and suggestions for the authors. See below.

      Reviewer #3 (Public review):

      Summary:

      The authors aimed to test whether hypoxia disrupts the migration of human cortical interneurons, a process long suspected to underlie brain injury in preterm infants but previously inaccessible for direct study. Using human forebrain assembloids and ex vivo developing brain tissue, they visualized and quantified interneuron migration under hypoxic conditions, identified molecular components of the response, and explored the effect of pharmacological intervention (specifically ADM) on restoring the migration deficits.

      Strengths:

      The major strength of this study lies in its use of human forebrain assembloids and ex vivo prenatal brain tissue, which provide a direct system to study interneuron migration under hypoxic conditions. The authors combine multiple approaches: long-term live imaging to directly visualize interneuron migration, bulk and single-cell transcriptomics to identify hypoxia-induced molecular responses, pharmacological rescue experiments with ADM to establish therapeutic potential, and mechanistic assays implicating the cAMP/PKA/pCREB pathway and GABA receptor expression in mediating the effect. Together, this rigorous and multifaceted strategy convincingly demonstrates that hypoxia disrupts interneuron migration and that ADM can restore this defect through defined molecular mechanisms.

      Overall, the authors achieve their stated aims, and the results strongly support their  conclusions. The work has a significant impact by providing the first direct evidence of hypoxia-induced interneuron migration deficits in the human context, while also nominating a candidate therapeutic avenue. Beyond the specific findings, the methodological platform - particularly the combination of assembloids and live imaging - will be broadly useful to the community for probing neurodevelopmental processes in health and disease.

      Weaknesses:

      The main weakness of the study lies in the extent to which forebrain assembloids

      recapitulate in vivo conditions, as the migration of interneurons from hSO to hCO does not fully reflect the native environment or migratory context of these cells. Nevertheless, this limitation is tempered by the fact that the work provides the first direct observation of human interneuron migration under hypoxia, representing a major advance for the field. In addition, while the transcriptomic analyses are valuable and highlight promising candidates, more in-depth exploration will be needed to fully elucidate the molecular mechanisms governing neuronal migration and maturation under hypoxic conditions.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      (1) The authors should examine if all cortical interneurons are affected by ADM or only subtypes (Parvalbumin/Somatostatin).

      We thank the reviewer for raising this important question. In our study, we utilized the Dlx1/2b::eGFP reporter to broadly label cortical interneurons; however, this system does not distinguish specific interneuron subtypes. To address this, in the revised version of the manuscript we will use the single-cell RNA sequencing data and immunostainings to provide this information. Based on previous analyses from Birey et al (Cell Stem Cell, 2022), we expect interneurons within assembloids to express mostly calbindin (CALB2) and somatostatin (SST) at this in vitro stage of development; parvalbumin subtype appears later based on data from Birey et al (Nature, 2017) and more recently from Varela et al, (bioRxiv, 2025).

      In parallel, we will analyze available scRNA-seq data from developing human primary brain tissue a similar age as the one used in the manuscript, and check whether these subtypes of interneurons are similar to the ones within assembloids.

      (2) The authors should test more candidates from their bulk RNA-seq data with different fold changes for regulation after hypoxia, to allow the reader to judge at which cut-off the DEGs may be reproducible. This would make this database much more valuable for the field of hypoxia research.

      We appreciate the reviewers’ thoughtful suggestion. In addition to the bulk RNA-seq analysis, we did validate several upregulated hypoxia-responsive genes with varying fold changes by qPCR; these include PDK1, PFKP, VEGFA (Figure S1). 

      We go agree that in-depth investigation of specific cut-offs would be interesting, however, this could be the focus of a different manuscript.

      Reviewer #2 (Recommendations for the authors):

      (1) Can the authors comment on the possibility of inflammatory response pathways being activated by hypoxia? Has this been shown before? While not the focus of the manuscript, it could be discussed in the Discussion as an interesting finding and potential involvement of other cells in the Hypoxic response.

      We thank the reviewer this important comment about inflammation. Indeed, hypoxia has been shown to activate the inflammatory response pathways. In various studies, it was found that HIF-1a can interact with NF-κB signaling, leading to the upregulation of pro-inflammatory cytokines such as IL-1β, IL-6, and TNF-α (Rius et al., Cell, 2008; Hagberg et al., Nat Rev Neurol, 2015).

      In our transcriptomics data (Figure 2D), and to the reviewers’ point, we identified enrichment of inflammatory signaling response following the hypoxic exposure. Since hSO at the time of analyses do contain astrocytes, we think these glia contribute to the observed pro-inflammatory changes. Based on these results and because ADM is known to have strong anti-inflammatory properties, the effects of ADM on hypoxic astrocytes should be investigated in future studies focused on hypoxia-induced inflammation. In the revision, we will address this comment in the discussion section and cite the appropriate papers.

      (2) Could the authors comment on the mechanism at play here with respect to ADM and binding to RAMP2 receptors - is this a potential autocrine loop, or is the source of ADM from other cell types besides inhibitory neurons? Given the scRNA-seq data, what cell-to-cell mechanisms can be at play? Since different cells express ADM, there could be different mechanisms in place in ventral vs dorsal areas.

      Based on our scRNA-seq data in hSOs showing significant upregulation of ADM expression in astrocytes and progenitors, we speculate that the primary mechanism is likely to involve paracrine interactions. However, we cannot exclude autocrine mechanisms with the included experiments. Dissecting these interactions in a cell-type specific manner could be an important focus for future ADM-related studies.

      To address the question about the possible different mechanisms in ventral versus dorsal areas, in the revision we will plot and include in the figures the data about the cell-type expression of ADM and its receptors in hCOs.

      (3) For data from Figure 6 - while the ELISA assays are informative to determine which pathways (PKA, AKT, ERK) are active, there is no positive control to indicate these assays are "working" - therefore, if possible, western blot analysis from assembloid tissue could be used (perhaps using the same lysates from Figure 3) as an alternative to validate changes at the protein level (however, this might prove difficult); further to this, is P-CREB activated at the protein level using WB?

      We thank the reviewer for this comment and the observation. Although we did not include a traditional positive control in these ELISA assays, several lines of evidence indicate that the measurements are reliable. First, the standard curves behaved as expected, and all sample values fell within the assay’s dynamic range. Second, technical replicates showed low variability, and the observed changes across experimental conditions (e.g., hypoxia vs. control) were consistent with the expected biological responses based on previous literature. We agree that including western blot validation would strengthen the findings, and we will note this for our future studies focused on CREB and ADM.

      (4) Could the authors comment further on the mechanism and what biological pathways and potential events are downstream of ADM binding to RAMP2 in inhibitory neurons? What functional impact would this have linked to the CREB pathway proposed? While the link to GABA receptors is proposed, CREB has many targets beyond this.

      We appreciate the reviewers’ insightful question. Currently, not much is known about the molecular pathways and downstream cellular events triggered by ADM binding to RAMP2 in inhibitory neurons, and in general in brain cells. The data from our study brings the first information about the cell-type specific expression of ADM in baseline and hypoxic conditions and is one of the key novelties of our study.

      While the signaling landscape of ADM in interneurons is largely unexplored, several studies in other (non-brain) cell types have demonstrated that ADM binding to RAMP2 can activate downstream cascades such as the cAMP/PKA/CREB pathway, PI3K/AKT, and ERK/MAPK, all of which are also known to be critical regulators of neuronal development and survival. These previously published data along with our CREB-targeted findings in hypoxic interneurons, suggest ADM–RAMP2 signaling could influence multiple aspects of interneuron biology, but these remain to be evaluated in future studies.

      We agree with the reviewer that CREB has a wide range of transcriptional targets. We decided to focus on GABA as a target of CREB for two main reasons, including: (i) GABA signaling has been previously shown to play an important role in the migration of cortical interneurons, and (ii) a previous study by Birey et al. (Cell Stem Cell, 2022) demonstrated that CREB pathway activity is essential for regulating interneuron migration in assembloid models of Timothy Syndrom, thus further providing evidence that dysregulation of CREB activity disrupts migration dynamics.

      While our study provides a first step toward uncovering the mechanisms of interneuron migration protection by ADM, we fully acknowledge that future work will be needed to delineate the full spectrum of ADM–RAMP2 downstream signaling events in inhibitory neurons and other brain cells.

      (5) Does hypoxia cause any changes to inhibitory neurogenesis (earlier stages than migration?) - this might always be known, but was not discussed.

      We appreciate this question from the reviewer; however, this was not something that we focused on in this manuscript due to the already large amount of data included. A separate study focusing on neurogenesis defects and the molecular mechanisms of injury for that specific developmental process would be an important next step.

      (6) In the Discussion section, it might be worth detailing to the readers what the functional impact of delayed/reduced migration of inhibitory neurons into the cortex might result in, in terms of functional consequences for neural circuit development.

      We thank the Reviewer for the suggestion of detailing the functional impact of reduced inhibitory neuron migration. We will revise the manuscript by incorporating a paragraph about this in the Discussion section.

      Reviewer #3 (Recommendations for the authors):

      Most of the evidence presented is convincing in supporting the conclusions, and I have only minor suggestions for improvement:

      (1) The bulk RNA-seq was performed in hSOs only, which may not fully capture the phenotypes of migrating or migrated interneurons. It would be valuable, if feasible, to sort migrated cells from hSO-hCO assembloids and specifically examine their molecular mediators.

      We thank the reviewer for this suggestion. While it is likely that the cellular environment will have some influence on a subset of the molecular changes, based on all the data from the manuscript and our specific target, the RNA-sequencing on hCOs was sufficient to capture essential changes like ADM upregulation. The in-depth exploration on differential responses of migrated versus non-migrated interneurons to hypoxia could be the focus of a different project.

      (2) In Figure 3, it is striking that cell-type heterogeneity dominates over hypoxia vs. control conditions. A joint embedding of hSO and hCO cells could provide further insight into molecular differences between migrated and non-migrated interneurons.

      We thank the reviewer for this observation and opportunity to clarify. Since we manually separated the assembloids before the analyses, we processed these samples separately. That is why they separate like this. In the revision, we will add data about ADM expression and its receptors’ expression in the hCOs.

      (3) It would be helpful to expand the discussion on how closely the migration observed in hSO-hCO assembloids reflects in vivo conditions, and what environmental aspects are absent from this model. This would better frame the interpretation and translational relevance of the findings.

      We thank the Reviewer for bringing up this important point. Although the assembloid model offers the unique advantage of allowing the direct investigation of migration patterns of hypoxic interneurons, we fully agree it does not fully recapitulate the in vivo environment. While there are multiple aspects that cannot be recapitulated in vitro at this time (e.g. cellular complexity, vasculature, immune response, etc), we are encouraged by the validation of our main findings in ex vivo developing human brain tissue, which strongly supports the validity of our findings for in vivo conditions.

      We will expand our discussion to include more details and the need to validate these findings using in vivo models, while also acknowledging that different species (e.g. rodents versus non-human primates versus humans) might have different responses to hypoxia.

      (4) The authors suggest that hypoxia is also associated with delayed interneuron maturation, yet the bulk RNA-seq data primarily reveal stress and hypoxia-related genes. A more detailed discussion of why genes linked to interneuron maturation and function were not strongly affected would clarify this point.

      We thank the Reviewer for the opportunity to clarify.

      The RNAseq data was performed during the acute stages of hypoxia/reoxygenation and we think a maturation phenotype might be difficult to capture at this point and would require analysis at later in vitro assembloid maturation stages.

      Our speculation about a possible maturation defect is based on data from previous studies from developmental biology that showed failure of interneurons to reach their final cortical location within a specified developmental window will impair their integration within the neuronal network, and thus lead to maturation defects and possible elimination by apoptosis.

      Since preterm infants suffer from countless hypoxic events over multiple months, we suggest these repetitive events are likely to induce cumulative delays in migration, inability of interneurons to reach their target in time, followed by abnormal integration within the excitatory network, and eventual elimination of some of these interneurons through apoptosis. However, the direct demonstration of this effect following a hypoxic insult would require prolonged in vivo experiments in rodents to follow the migration, network integration and apoptosis of interneurons; to our knowledge this experimental design is not technically feasible at this time.

      (5) Relatedly, while the focus on interneuron migration is well justified, acknowledging how hypoxia might also impact other aspects of cortical development (e.g., progenitor proliferation, neuronal maturation, or circuit integration) would place the findings in a broader developmental framework and strengthen their relevance.

      We appreciate the Reviewer’s suggestion to discuss the role of hypoxia on other processes during cortical development. In the revised manuscript, we will include citations about the effects of hypoxia on interneuron proliferation, maturation and circuit integration as available, and also expand to other cell types known to be affected.

      (6) Very minor: in Figure S3C and D, it was not stated what the colors mean (grey: control, yellow: hypoxia)

      Thank you for pointing out this error and we will correct it in our revision.

    1. This manuscript examines preprint review services and their role in the scholarly communications ecosystem.  It seems quite thorough to me. In Table 1 they list many peer-review services that I was unaware of e.g. SciRate and Sinai Immunology Review Project.

      To help elicit critical & confirmatory responses for this peer review report I am trialling Elsevier’s suggested “structured peer review” core questions, and treating this manuscript as a research article.

      Introduction

      1. Is the background and literature section up to date and appropriate for the topic?

        Yes.

      2. Are the primary (and secondary) objectives clearly stated at the end of the introduction?

        No. Instead the authors have chosen to put the two research questions on page 6 in the methods section. I wonder if they ought to be moved into the introduction – the research questions are not methods in themselves. Might it be better to state the research questions first and then detail the methods one uses to address those questions afterwards? [as Elsevier’s structured template seems implicitly to prefer.

      Methods

      1. Are the study methods (including theory/applicability/modelling) reported in sufficient detail to allow for their replicability or reproducibility?

        I note with approval that the version number of the software they used (ATLAS.ti) was given.

        I note with approval that the underlying data is publicly archived under CC BY at figshare.

        The Atlas.ti report data spreadsheet could do with some small improvement – the column headers are little cryptic e.g. “Nº  ST “ and “ST” which I eventually deduced was Number of Schools of Thought and Schools of Thought (?)   

        Is there a rawer form of the data that could be deposited with which to evidence the work done? The Atlas.ti report spreadsheet seemed like it was downstream output data from Atlas.ti. What was the rawer input data entered into Atlas.ti? Can this be archived somewhere in case researchers want to reanalyse it using other tools and methods.

        I note with disapproval that Atlas.ti is proprietary software which may hinder the reproducibility of this work. Nonetheless I acknowledge that Atlas.ti usage is somewhat ‘accepted’ in social sciences despite this issue.

        I think the qualitative text analysis is a little vague and/or under-described: “Using ATLAS.ti Windows (version 23.0.8.0), we carried out a qualitative analysis of text from the relevant sites, assigning codes covering what they do and why they have chosen to do it that way.” That’s not enough detail. Perhaps an example or two could be given? Was inter-rater reliability performed when ‘assigning codes’ ? How do we know the ‘codes’ were assigned accurately?

      2. Are statistical analyses, controls, sampling mechanism, and statistical reporting (e.g., P-values, CIs, effect sizes) appropriate and well described?

        This is a descriptive study (and that’s fine) so there aren’t really any statistics on show here other than simple ‘counts’ (of Schools of Thought) in this manuscript. There are probably some statistical processes going on within the proprietary qualitative analysis of text done in ATLAS.ti but it is under described and so hard for me to evaluate. 

      Results

      1. Is the results presentation, including the number of tables and figures, appropriate to best present the study findings?

        Yes. However, I think a canonical URL to each service should be given.  A URL is very useful for disambiguation, to confirm e.g. that the authors mean this Hypothesis (www.hypothes.is) and NOT this Hypothesis (www.hyp.io). I know exactly which Hypothesis is the one the authors are referring to but we cannot assume all readers are experts 😊

        Optional suggestion: I wonder if the authors couldn’t present the table data in a slightly more visual and/or compact way? It’s not very visually appealing in its current state. Purely as an optional suggestion, to make the table more compact one could recode the answers given in one or more of the columns 2, 3 and 4 in the table e.g. "all disciplines = ⬤ , biomedical and life sciences = ▲, social sciences =  ‡  , engineering and technology = † ". I note this would give more space in the table to print the URLs for each service that both reviewers have requested.

        ———————————————————————————————

        | Service name | Developed by | Scientific disciplines | Types of outputs |

        | Episciences | Other | ⬤ | blah blah blah. |

        | Faculty Opinions | Individual researcher | ▲ | blah blah blah. |

        | Red Team Market | Individual researcher | ‡ | blah blah blah. |

        ———————————————————————————————

        The "Types of outputs" column might even lend themselves to mini-colour-pictograms (?) which could be more concise and more visually appealing? A table just of text, might be scientifically 'correct' but it is incredibly dull for readers, in my opinion.

      2. Are additional sub-analyses or statistical measures needed (e.g., reporting of CIs, effect sizes, sensitivity analyses)?

        No / Not applicable. 

      Discussion

      1. Is the interpretation of results and study conclusions supported by the data and the study design?

        Yes.

      2. Have the authors clearly emphasized the limitations of their study/theory/methods/argument?

        No. Perhaps a discussion of the linguistic/comprehension bias of the authors might be appropriate for this manuscript. What if there are ‘local’ or regional Chinese, Japanese, Indonesian or Arabic language preprint review services out there? Would this authorship team really be able to find them?

      Additional points:

      • Perhaps the points made in this manuscript about financial sustainability (p24) are a little too pessimistic. I get it, there is merit to this argument, but there is also some significant investment going on there if you know where to look. Perhaps it might be worth citing some recent investments e.g. Gates -> PREreview (2024) https://content.prereview.org/prereview-welcomes-funding/  and Arcadia’s $4 million USD to COAR for the Notify Project which supports a range of preprint review communities including Peer Community In, Episciences, PREreview and Harvard Library.  (source: https://coar-repositories.org/news-updates/coar-welcomes-significant-funding-for-the-notify-project/

      • Although I note they are mentioned, I think more needs to be written about the similarity and overlap between ‘overlay journals’ and preprint review services. Are these arguably not just two different terms for kinda the same thing? If you have Peer Community In which has it’s overlay component in the form of the Peer Community Journal, why not mention other overlay journals like Discrete Analysis and The Open Journal of Astrophysics.   I think Peer Community In (& it’s PCJ) is the go-to example of the thin-ness of the line the separates (or doesn’t!) overlay journals and preprint review services. Some more exposition on this would be useful.

    2. Thank you very much for the opportunity to review the preprint titled “Preprint review services: Disrupting the scholarly communication landscape?” (https://doi.org/10.31235/osf.io/8c6xm) The authors review services that facilitate peer review of preprints, primarily in the STEM (science, technology, engineering, and math) disciplines. They examine how these services operate and their role within the scholarly publishing ecosystem. Additionally, the authors discuss the potential benefits of these preprint peer review services, placing them in the context of tensions in the broader peer review reform movement. The discussions are organized according to four “schools of thought” in peer review reform, as outlined by Waltman et al. (2023), which provides a useful framework for analyzing the services. In terms of methodology, I believe the authors were thorough in their search for preprint review services, especially given that a systematic search might be impractical.

      As I see it, the adoption of preprints and reforming peer review are key components of the move towards improving scholarly communication and open research. This article is a useful step along that journey, taking stock of current progress, with a discussion that illuminates possible paths forward. It is also well-structured and easy for me to follow. I believe it is a valuable contribution to the metaresearch literature.

      On a high level, I believe the authors have made a reasonable case that preprint review services might make peer review more transparent and rewarding for all involved. Looking forward, I would like to see metaresearch which gathers further evidence that these benefits are truly being realised.

      In this review, I will present some general points which merit further discussion or clarification to aid an uninitiated reader. Additionally, I raise one issue regarding how the authors framed the article and categorised preprint review services and the disciplines they serve. In my view, this problem does not fundamentally undermine the robust search, analyses, and discussion in this paper, but it risks putting off some researchers and constrains how broadly one should derive conclusions.

      General comments

      Some metaresearchers may be aware of preprints, but not all readers will be familiar with them. I suggest briefly defining what they are, how they work, and which types of research have benefited from preprints, similar to how “preprint review service” is clearly defined in the introduction.

      Regarding Waltman et al.’s (2023) “Equity & Inclusion” school of thought, does it specifically aim for “balanced” representation by different groups as stated in this article? There is an important difference between “balanced” versus “equitable” representation, and I would like to see it addressed in this text.

      Another analysis I would like to see is whether any of the 23 services reviewed present any evidence that their approach has improved research quality. For instance, the discussion on peer review efficiency and incentives states that there is currently “no hard evidence” that journals want to utilise reviews by Rapid Reviews: COVID-19, and that “not all journals are receptive” to partnerships. Are journals skeptical of whether preprint review services could improve research quality? Or might another dynamic be at work?

      The authors cite Nguyen et al. (2015) and Okuzaki et al. (2019), stating that peer review is often “overloaded”. I would like to see a clearer explanation by what “overloaded” means in this context so that a reader does not have to read the two cited papers.

      To the best of my understanding, one of the major sticking points in peer review reform is whether to anonymise reviewers and/or authors. Consequently, I appreciate the comprehensive discussion about this issue by the authors.

      However, I am only partially convinced by the statement that double anonymity is “essentially incompatible” with preprint review. For example, there may be, as yet not fully explored, ways to publish anonymous preprints with (a) a notice that it has been submitted to, or is undergoing, peer review; and (b) that the authors will be revealed once peer review has been performed (e.g. at least one review has been published). This would avoid the issue of publishing only after review is concluded as is the case for Hypothesis and Peer Community In.

      Additionally, the authors describe 13 services which aim to “balance transparency and protect reviewers’ interests”. This is a laudable goal, but I am concerned that framing this as a “balance” implies a binary choice, and that to have more of one, we must lose an equal amount of the other. Thinking only in terms of “balance” prevents creative, win-win solutions. Could a case be made for non-anonymity to be complemented by a reputation system for authors and reviewers? For example, major misconduct (e.g. retribution against a critical review) would be recorded in that system and dissuade bad actors. Something similar can already be seen in the reviewer evaluation system of CrowdPeer, which could plausibly be extended or modified to highlight misconduct.

      I also note that misconduct and abusive behaviour already occur even in fully or partially anonymised peer review, and they are not limited to the review or preprints. While I am not aware of existing literature on this topic, academics’ fears seem reasonable. For example, there is at least anecdotal testimonies that a reviewer would deliberately reject a paper to retard the progress of a rival research group, while taking the ideas of that paper and beating their competitors to winning a grant. Or, a junior researcher might refrain from giving a negative review out of fear that the senior researcher whose work they are reviewing might retaliate. These fears, real or not, seem to play a part in the debates about if and how peer review should (or should not) be anonymised. I would like to see an exploration of whether de-anonimisation will improve or worsen this behaviour and in what contexts. And if such studies exist, it would be good to discuss them in this paper.

      I found it interesting that almost all preprint review services claim to be complementary to, and not compete with, traditional journal-based peer review. The methodology described in this article cannot definitely explain what is going on, but I suspect there may be a connection between this aversion to compete with traditional journals, and (a) the skepticism of journals towards partnering with preprint review services and (b) the dearth of publisher-run options. I hypothesise that there is a power dynamic at play, where traditional publishers have a vested interest in maintaining the power they hold over scholarly communication, and that preprint review services stress their complementarity (instead of competitiveness) as a survival mechanism. This may be an avenue for further metaresearch.

      To understand preprints from which fields of research are actually present on the services categorised under “all disciplines,” I used the Random Integer Set Generator by the Random.org true random number service (https://www.random.org/integer-sets/) to select five services for closer examination: Hypothesis, Peeriodicals, PubPeer, Qeios, and Researchers One. Of those, I observed that Hypothesis is an open source web annotation service that allows commenting on and discussion of any web page on the Internet regardless of whether it is research or preprints. Hypothesis has a sub-project named TRiP (Transparent Review in Preprints), which is their preprint review service in collaboration with Cold Spring Harbor Laboratory. It is unclear to me why the authors listed Hypothesis as the service name in Table 1 (and elsewhere) instead of TRiP (or other similar sub-projects). In addition, Hypothesis seems to be framed as a generic web annotation service that is used by some as a preprint review tool. This seems fundamentally different from others who are explicitly set up as preprint review services. This difference seems noteworthy to me.

      To aid readers, I also suggest including hyperlinks to the 23 services reviewed in this paper. My comments on disciplinary representation in these services are elaborated further below.

      One minor point of curiosity is that several services use an “automated tool” to select reviewers. It would be helpful to describe in this paper exactly what those tools are and how they work, or report situations where services do not explain it.

      Lastly, what did the authors mean by “software heritage” in section 6? Are they referring to the organisation named Software Heritage (https://www.softwareheritage.org/) or something else? It is not clear to me how preprint reviews would be deposited in this context.

      Respecting disciplinary and epistemic diversity

      In the abstract and elsewhere in the article, the authors acknowledge that preprints are gaining momentum “in some fields” as a way to share “scientific” findings. After reading this article, I agree that preprint review services may disrupt publishing for research communities where preprints are in the process of being adopted or already normalised. However, I am less convinced that such disruption is occurring, or could occur, for scholarly publishing more generally.

      I am particularly concerned about the casual conflation of “research” and “scientific research” in this article. Right from the start, it mentions how preprints allow sharing “new scientific findings” in the abstract, stating they “make scientific work available rapidly.” It also notes that preprints enable “scientific work to be accessed in a timely way not only by scientists, but also…” This framing implies that all “scholarly communication,” as mentioned in the title, is synonymous with “scientific communication.” Such language excludes researchers who do not typically identify their work as “scientific” research. Another example of this conflation appears in the caption for Figure 1, which outlines potential benefits of preprint review services. Here, “users” are defined as “scientists, policymakers, journalists, and citizens in general.” But what about researchers and scholars who do not see themselves as “scientists”?

      Similarly, the authors describe the 23 preprint review services using six categories, one of which is “scientific discipline”. One of those disciplines is called “humanities” in the text, and Table 1 lists it as a discipline for Science Open Reviewed. Do the authors consider “humanities” to be a “scientific” discipline? If so, I think that needs to be justified with very strong evidence.

      Additionally, Waltman et al.’s four schools of thought for peer review reform works well with the 23 services analysed. However, at least three out of the four are explicitly described as improving “scientific” research.

      Related to the above are how the five “scientific disciplines” are described as the “usual organisation” of the scholarly communication landscape. On what basis should they be considered “usual”? In this formulation, research in literature, history, music, philosophy, and many other subjects would all be lumped together into the “humanities”, which sit at the same hierarchical level as “biomedical and life sciences”, arguably a much more specific discipline. My point is not to argue for a specific organisation of research disciplines, but to highlight a key epistemic assumption underlying the whole paper that comes across as very STEM-centric (science, technology, engineering, and math).

      How might this part of the methodology affect the categories presented in Table 1? “Biomedical and life sciences” appear to be overrepresented compared to other “disciplines”. I’d like to see a discussion that examines this pattern, and considers why preprint review services (or maybe even preprints more generally) appear to cover mostly the biomedical or physical sciences.

      In addition, there are 12 services described as serving “all disciplines”. I believe this paper can be improved by at least a qualitative assessment of the diversity of disciplines actually represented on those services. Because it is reported that many of these service stress improving the “reproducibility” of research, I suspect most of them serve disciplines which rely on experimental science.

      I randomly selected five services for closer examination, as mentioned above. Of those, only Qeios has demonstrated an attempt to at least split “arts and humanities” into subfields. The others either don’t have such categories altogether, or have a clear focus on a few disciplines (e.g. life sciences for Hypothesis/TRiP). In all cases I studied, there is a heavy focus on STEM subjects, especially biology or medical research. However, they are all categorised by the authors as serving “all disciplines”.

      If preprint review services originate from, or mostly serve, a narrow range of STEM disciplines (especially experiment-based ones), it would be worth examining why that is the case, and whether preprints and reviews of them could (or could not) serve other disciplines and epistemologies.

      It is postulated that preprint review services might “disrupt the scholarly communication landscape in a more radical way”. Considering the problematic language I observed, what about fields of research where peer-reviewed journal publications are not the primary form of communication? Would preprint review services disrupt their scholarly communications?

      To be clear, my concern is not just the conflation of language in a linguistic sense but rather inequitable epistemic power. I worry that this conflation would (a) exclude, minoritise, and alienate researchers of diverse disciplines from engaging with metaresearch; and (b) blind us from a clear pattern in these 23 services, that is their strong focus on the life sciences and medical research and a discussion of why that might be the case. Critically, what message are we sending to, for example, a researcher of 18th century French poetry with the language and framing of this paper? I believe the way “disciplines” are currently presented here poses a real risk of devaluing and minoritising certain subject areas and ways of knowing. In its current form, I believe that while this paper is a very valuable contribution, one should not derive from it any conclusions which apply to scholarly publishing as a whole.

      The authors have demonstrated inclusive language elsewhere. For example, they have consciously avoided “peer” when discussing preprint review services, clearly contrasting them to “journal-based peer review”. Therefore, I respectfully suggest that similar sensitivity be adopted to avoid treating “scientific research” and “research” as the same thing. A discussion, or reference to existing works, on the disciplinary skew of preprints (and reviews of them) would also add to the intellectual rigour of this already excellent piece.

      Overall, I believe this paper is a valuable reflection on the state of preprints and services which review them. Addressing the points I raised, especially the use of more inclusive language with regards to disciplinary diversity, would further elevate its usefulness in the metaresearch discourse. Thank you again for the chance to review.

      Signed:

      Dr Pen-Yuan Hsing (ORCID ID: 0000-0002-5394-879X)

      University of Bristol, United Kingdom

      Data availability

      I have checked the associated dataset, but still suggest including hyperlinks to the 23 services analysed in the main text of this paper.